Slender Meanshttp://slendermeans.org/2014-05-07T19:52:00-04:00Rejection/Feedback: A play in three acts2014-05-07T19:52:00-04:00Carltag:slendermeans.org,2014-05-07:rejection-feedback.html<style>
blockquote {
font-family: 'Source Code Pro', monospace;
margin-left: .5rem;
margin-right: 0;
margin-top: 0rem;
text-indent: 0;
border: none;
}
p {margin-top: .5rem; }
p+p {text-indent: 0;}
blockquote em {
font-family: 'Source Code Pro';
font-style: italic;
}
blockquote strong {
font-family: 'Source Code Pro', monospace;
font-weight: 600;
text-decoration: underline;
}
.e-content blockquote ul > li:before {
color: #222;
content: '-';
font-family: 'Source Code Pro', monospace;
}
.e-content blockquote ul > li {
margin-left: 0.5rem;
}
</style>
<h2>Act 1</h2>
<blockquote>
<p>Dear <span class="caps">XXXX</span>,</p>
<p>Thank you for coming in to interview with the team last week. Everyone
enjoyed speaking with you, but unfortunately it was decided that your
background and experience is not an ideal fit for the position. Please be
assured that this decision was arrived at after careful and thorough deliberation.</p>
<p>Again, we appreciate your time, and wish you the best of success in your job search.</p>
<p>Sincerely,</p>
<p>Bea Pearson-Handler</br>
Recruiting</br>
FourPaws.com (<em>FourSquare for your Dog!</em>)</p>
</blockquote>
<h2>Act 2</h2>
<blockquote>
<p>Hi Bea,</p>
<p>Thank you for the e-mail. I enjoyed meeting with everyone and I’m very sorry
to hear the news.</p>
<p>If it would be at all possible, it would be very helpful to me if you could
provide me with any feedback regarding my interview or the hiring decision.
Really, anything at all would be helpful and I would very much appreciate it.</p>
<p>Thanks to you and everyone on the team for your time.</p>
<p>Best,</br>
<span class="caps">XXXX</span></p>
</blockquote>
<h2>Act 3</h2>
<blockquote>
<p>Dear <span class="caps">XXXX</span>,</p>
<p>I’ve included below some notes from the interviewers. Hopefully these will
be helpful to you. Again, best of luck in your job search.</p>
<p>Bea Pearson-Handler</br>
Recruiting</br>
FourPaws.com (<em>FourSquare for your Dog!</em>)
<br>
<br></p>
<p><strong>9:30-10:<span class="caps">30AM</span> “Big Picture” Interview:</strong></p>
<ul>
<li>
<p>Candidate arrived 15 minutes late; claimed he was stuck behind “an Asian
driver” on his way to the office.</p>
</li>
<li>
<p>Questionable personal hygiene. </p>
</li>
<li>
<p>When asked why he wanted to work at FourPaws, candidate referred to his “20—actually I guess it’s 30—grand” gambling debt, and his bookie, “Jimmy Two-Thumbs.”</p>
</li>
<li>
<p>When asked where he saw himself in five years, candidate replied “Working at Google.”</p>
</li>
</ul>
</blockquote>
<p><br></p>
<blockquote>
<p><strong>10-<span class="caps">45AM</span>-12:<span class="caps">00PM</span> Case Study Interview:</strong></p>
<ul>
<li>
<p>Candidate was asked how he might improve engagement of female users with certain features of the site. His answer began with “Chicks, am I right?”</p>
</li>
<li>
<p>When asked why manhole covers were round, candidate replied “Because round is the shape of the Platonic ideal of a hole,” then hedged with “Or, so that fat Con Ed guys can get through them.”</p>
</li>
<li>
<p>Candidate was asked to estimate the number of piano tuners in Chicago. Candidate responded, “Well, you drop one piano tuner from the middle floor of the building. Then you take the second one and light him on fire at both ends. Wait, what was the question?”</p>
</li>
</ul>
</blockquote>
<p><br></p>
<blockquote>
<p><strong>12:<span class="caps">00PM</span> - 1:<span class="caps">00PM</span> “Casual” Lunch Interview:</strong></p>
<ul>
<li>
<p>Candidate served himself one pork sausage, then generous servings of
everything at the Vegan table.</p>
</li>
<li>
<p>Chewed with mouth open.</p>
</li>
<li>
<p>When asked what he does in his spare time, candidate said that he was a prolific author of “<span class="caps">HGTV</span> slash fiction, especially the <a href="http://www.hgtv.com/property-brothers/show/index.html"><em>Property Brothers</em></a>”</p>
</li>
<li>
<p>Asked if soda was free at the firm. When told yes, asked if we had <span class="caps">RC</span> Cola. When told no, asked if we would special order it if he were hired. Seemed strangely adamant on this point; mentioned he would “bring it up in negotiations.”</p>
</li>
</ul>
</blockquote>
<p><br></p>
<blockquote>
<p><strong>1:00-2:<span class="caps">00PM</span> Technical Interview:</strong></p>
<ul>
<li>
<p>Candidate was given the option to use whiteboard or paper to sketch out code. Candidate said he preferred paper; removed quill pen and inkwell from his shoulder bag.</p>
</li>
<li>
<p>Candidate was asked to code FizzBuzz. Candidate replied that it was a trick question, since “3 and 5 are prime, so nothing is divisible by them except one.”</p>
</li>
<li>
<p>Candidate used Bubble Sort to sort an array. When asked if there was a more efficient algorithm, he said yes, but claimed he didn’t want to use it, “because [his] religion forbids it.” </p>
</li>
</ul>
</blockquote>
<p><br></p>
<blockquote>
<p><strong>2:00-2:15 <span class="caps">HR</span> Wrapup:</strong></p>
<ul>
<li>Candidate asked if he could be reimbursed for travel expenses, since he drives a 1984 Buick Skylark, “for irony.”</li>
<li>Candidate asked if signing bonuses were typically offered, since he had “a good feeling about the Broncos this year.”</li>
<li>Candidate asked if “that girl at reception, you know, with the shirt,” was single.</li>
</ul>
</blockquote>
<p><center>☙ <em>Fin</em> ❧</center></p>A Geneva Convention for the Language Wars2014-01-17T14:30:00-05:00Carltag:slendermeans.org,2014-01-17:language-wars.html<p>I don’t tend to get too sniffy about the quality of discourse on the
Internet. I have some appreciation for even the most pointless, uninformed
flamewars. (And maybe my take on Web site comments is for another post.) But
there’s an increasingly popular topic of articles and blog posts which is starting to annoy
me a little. You’ve likely read them—they have titles like: “Python is Eating R’s Lunch,” “Why
Python is Going to Take Over Data Science,” “Why Python is a Pain in the Ass and
Will Never Beat R,” “Why Everyone Will Live on the Moon and Code in Julia in 5
Years,” etc.</p>
<p>This style of article obviously isn’t unique to data analysis
languages. It’s a classic nerd flamewar, in the proud tradition of text editor wars
and browser wars. Perhaps an added inflammatory agent here is the Data Science hype machine.</p>
<p>And that’s all okay. Go on the Internet and bitch about languages you don’t like, or tell
everyone why your preferred one is awesome. That’s what
<a href="http://news.ycombinator.com" title="Hacker News">the Internet’s here for</a>. And Lord knows I’ve done it myself.</p>
<p>My only problem is that it distracts from more important, more
interesting conversations about what’s happening with data analysis
languages. Instead of pissing matches and popularity contests, the real
interesting phenomena is how developers are comparing notes, sharing cool
innovations, and increasing interoperability. A great example is the IPython
notebook. The notebook doesn’t make Python better than other languages—it makes
<a href="http://nbviewer.ipython.org/url/beowulf.csail.mit.edu/18.337/fractals.ipynb" title="IJulia notebook">all</a> <a href="http://gibiansky.github.io/IHaskell/demo.html" title="IHaskell Notebook">languages</a> <a href="http://nbviewer.ipython.org/github/minad/iruby/blob/master/IRuby-Example.ipynb" title="IRuby notebook">better</a>.</p>
<p>I think it’s a really fascinating time for folks who use and think about
computer languages. The last 5 years or so has seen not only the introduction of
really cool new languages, but also extraordinary developments in existing
ones. I’m psyched about all these languages and I want them all to succeed and
get better. Some days I want to code in R, some days in Python. Others in Julia,
or Clojure, or F#, or even C+<span style="margin-left:-4px">+</span>. I don’t want any of them to stagnate or
disappear, or be “beaten” by any of the others. And I don’t think that’s
happening anyway.</p>
<p>So what’s below is a somewhat tongue-in-cheek list of suggestions for
facilitating productive and interesting discussions comparing languages. Many of them are not specific
to our little R/Python/Matlab/Julia skirmishes, but apply to lots of different
language wars (C+<span style="margin-left:-4px">+</span> vs. Java, Python vs. Perl, Ruby vs. Python, Clojure vs. Scala, Haskell vs.,
I dunno, everybody?) The last section is comprised of a couple of very general
notes about civility. I’m strongly in favor everyone’s right to be a smug prick
on the Internet. But, you know, you should probably try not to be a smug prick
on the Internet.</p>
<p>And, please, feel free to add additions or suggestions in the comments, or in this <a href="https://gist.github.com/carljv/8554723" title="Gist">Gist</a></p>
<h2>Section 1: Being Aware of Context</h2>
<h3>§ 1, Article 1</h3>
<p>Recognize that languages have goals and communities. It helps to evaluate them in
that context. Features that are high priority to you may not be high priority to
the majority of users in that language, and vice-versa.</p>
<h3>§ 1, Article 2</h3>
<p>Recognize that many smart, capable people are very productive in the language
you’re slagging. The cool things science and industry are making in the language
speaks far louder than your casual dismissals of it on a message board.</p>
<p>The same logic goes for language developers. For example,
<a href="http://had.co.nz" title="Hadley Wickham">Hadley Wickham</a> is a smart guy and a great programmer; he’s probably not one to
waste his time improving a language that’s some irreparably broken
dead-end. Same with <a href="http://continuum.io" title="Continuum">these guys</a>.</p>
<h3>§ 1, Article 3</h3>
<p>Recognize that language design is the art of the tradeoff. Don’t complain about
a design choice until you understand the logic behind it. In many cases, your
preferred design or feature was already considered, and would have led to
undesirable outcomes elsewhere. It is helpful and
interesting to disagree about how a tradeoff was managed, but do recognize that there
was one.</p>
<h3>§ 1, Article 4</h3>
<p>Distinguish between a feature request and a language critique. If you come to a
new language and miss some features of
your old language, that’s fine. But that’s not necessarily a failing of the new language.</p>
<p>A living, breathing language is a combination of both its features and its idioms. A
feature may not exist because its programmers tend to write code in a way that
obviates its need. Sometimes such idioms are crutches to compensate for truly
useful features that are missing; other times they are interesting and elegant expressions of a
problem that you’re just not accustomed to. Try to spot the difference.</p>
<h3>§ 1, Article 5</h3>
<p>Pay your dues before dismissing a language. If you gave up on something in a language after finding it too difficult,
consider that the problem may be yours. It may not be, but at least
consider it.</p>
<h3>§ 1, Article 6</h3>
<p>Don’t over-sell immature, alpha-version features, no matter how
promising they are. Promises don’t cook rice. Sending unsuspecting users to
buggy, incomplete libraries just harms your cause in the long run.</p>
<p>Examples:</p>
<ul>
<li><span class="dquo">“</span>Julia has a fast-growing library of packages!” Sure, but less than a handful
are close to production quality.</li>
<li><span class="dquo">“</span>And now ggplot has been ported to Python!” <a href="http://blog.yhathq.com/posts/ggplot-for-python.html" title="ggplot for Python">Not quite yet.</a></li>
</ul>
<p>Honest advertising of works-in-progress is encouraged, though. There’s nothing
inherently wrong with immature libraries, many of which are fantastic.</p>
<h3>§ 1, Article 7</h3>
<p>Microbenchmarks are useful for understanding differences between languages and their
execution, but are
of limited use in pissing contests. No one knows exactly what percentage of the world’s
working software is comprised of Fibonacci number calculations, but our best
guess is not much. </p>
<h2>Section 2: Being Interesting</h2>
<h3>§ 2, Article 1</h3>
<p>Whether one language is going to take over another is not that
interesting, nor that meaningful. (When does a language get “taken over?” For
Christ’s sakes, there’s still a non-trivial amount of <span class="caps">COBOL</span> running out there in the wild.)</p>
<p>Competition is pointless, but comparison is not. Languages are increasingly adopting ideas from
each other, building interops with each other, and sharing
tooling. Having conversations about this process is far more interesting than
running popularity contests.</p>
<h3>§ 2, Article 2</h3>
<p>Avoid clichéd arguments. They are not necessarily incaccurate, but
they are boring.</p>
<p>Examples:</p>
<ul>
<li>R is a “<span class="caps">DSL</span>” or “not a real language” (see Article 2 below); R is “designed by statisticians,
not computer scientists.”</li>
<li><span class="dquo">“</span>Semantic whitespace in Python sucks.” (Generally, arguments over
syntax are boring.)</li>
<li><span class="dquo">“</span>Julia doesn’t have as many libraries as ${pretty much anything}.”</li>
</ul>
<p>In addition to arguments, also avoid clichéd phrases. (See, <em>e.g.</em>, “not ready
for prime-time.”)</p>
<h3>§ 2, Article 3</h3>
<p>Supplement abstract terms or subjective impressions with concrete definitions
and examples.</p>
<p>Examples of statements that could use concrete support:</p>
<ol>
<li><span class="dquo">“</span>Code in language X is <em>more expressive</em> than language Y.”</li>
<li><span class="dquo">“</span>R is a <em><span class="caps">DSL</span></em>, while Python is a <em>general purpose language.</em>”</li>
</ol>
<h2>Section 3: Being Civil</h2>
<h3>§ 4, Article 1</h3>
<p>Be sure that you can accurately summarize someone’s argument before you start
composing your rebuttal.</p>
<h3>§ 4, Article 2</h3>
<p>You are not so smart that you are entitled to be smug. Some tips:</p>
<ol>
<li>Nix hyperbolic vocabulary. No one and nothing associated with any language
is “stupid,” “dumb,” “crazy”, “broken,” etc.</li>
<li>Use of the word “fail” is strongly discouraged. Use of it as a noun is
strictly prohibited.</li>
<li>It is no victory—not even a moral one—to find someone
<a href="http://xkcd.com/386/" title="XKCD">wrong on the internet</a>. Don’t treat it a such. Offer a polite
factual correction and allow for the possibility that you’ve misunderstood.</li>
</ol>Tricked out iterators in Julia2014-01-13T15:15:00-05:00Carltag:slendermeans.org,2014-01-13:julia-iterators.html<script src="http://d3js.org/d3.v3.min.js" charset="utf-8"></script>
<script src="./scripts/gadfly.js"></script>
<h2>Introduction</h2>
<p>I want to spend some time messing about with iterators in Julia. I think they
not only provide a familiar and useful entry point into Julia’s type system and dispatch
model, they’re also interesting in their own right.<a name="fnm-multdisp" href="#fn-multdisp" class="footnote-mark">1</a> Clever application of iterators can
help to simplify complicated loops, better express their intent, and improve
memory usage.</p>
<p>A word of warning about the code here. Much of the it isn’t idiomatic Julia and I wouldn’t
necessarily recommend using this style in a serious project. I also can’t speak
to its performance vis-a-vis more obvious Julian alternatives. In some cases,
the style of the code examples below may help reduce memory usage, but
performance is not my main concern. (This may be the first blogpost about Julia
unconcerned with speed). Instead, I’m just interested in different ways of
expressing iteration problems.</p>
<p>For anyone who’d like to play along at home, there’s an IJulia notebook of
this material on <a href="https://github.com/carljv/Julia-Iterators/blob/master/iterator_tricks.ipynb">Github</a>, which can be viewed on nbviewer <a href="http://nbviewer.ipython.org/urls/raw2.github.com/carljv/Julia-Iterators/master/iterator_tricks.ipynb?create=1">here</a>.</p>
<h2>The Iterator Protocol</h2>
<p>What do I mean by iterators?<a name="fnm-iterable" href="#fn-iterable"
class="footnote-mark">2</a> I mean any <code>I</code> in Julia that works on
the right hand side of the statement <code>for i = I ...</code>. That is, anything you can for-loop
over. This includes not only data collections like Arrays, Dicts, and Sets, but
also more abstract types like Ranges, as well as what I’ll call “higher order”
iterators such as those that result from <code>zip</code> or <code>enumerate</code> functions.</p>
<p>As an equivalent definition, an iterator in Julia is any type that implements
the <strong>iterator protocol</strong>. The iterator protocol is comprised of three methods:
<code>start</code>, <code>next</code>, and <code>done</code>. So any type in Julia for which these three methods
are defined is an iterator. It might be a dumb iterator or a broken iterator,
but it’s an iterator. </p>
<p>Since the <code>for</code> statement works on iterators, and iterators are just a
collection of methods, we can define any for loop using calls to those methods. </p>
<p>For example, this simple for loop</p>
<div class="highlight"><pre><span class="n">arr</span> <span class="o">=</span> <span class="p">[</span><span class="mi">10</span><span class="p">:</span><span class="o">-</span><span class="mi">2</span><span class="p">:</span><span class="mi">1</span><span class="p">]</span>
<span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="n">arr</span>
<span class="n">println</span><span class="p">(</span><span class="n">i</span><span class="o">^</span><span class="mi">2</span><span class="p">)</span>
<span class="k">end</span>
</pre></div>
<div class="highlight"><pre>100
64
36
16
4
</pre></div>
<p>is equivalent to this</p>
<div class="highlight"><pre><span class="n">state</span> <span class="o">=</span> <span class="n">start</span><span class="p">(</span><span class="n">arr</span><span class="p">);</span>
<span class="k">while</span> <span class="o">!</span><span class="n">done</span><span class="p">(</span><span class="n">arr</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span>
<span class="n">i</span><span class="p">,</span> <span class="n">state</span> <span class="o">=</span> <span class="n">next</span><span class="p">(</span><span class="n">arr</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span>
<span class="n">println</span><span class="p">(</span><span class="n">i</span><span class="o">^</span><span class="mi">2</span><span class="p">)</span>
<span class="k">end</span>
</pre></div>
<div class="highlight"><pre>100
64
36
16
4
</pre></div>
<p>In this example, the <code>start</code> method provides the initial state of the iterator;
the <code>next</code> method returns the value of the array at a given state, as well as what the
next state is. Finally, the <code>done</code> method returns <code>true</code> when we’ve gone past
the end of the iterator, informing the loop that it should stop.</p>
<p>If you know Python, the idea of the iterator protocol is probably familiar. In
Python, any object can be an iterator if it has the methods <code>__iter__</code> and
<code>__next__</code>. But notice the lack of side effects in the Julia implementation
—calling <code>start</code> or <code>next</code> on the array has no affect on the array itself. <code>start</code> is
basically a constant, always returning the value of the initial state whenever
you pass it the same type of iterator. And <code>next</code> doesn’t really increment
anything; it’s just a mapping from <em>current state → (value, next
state)</em>. In general, the iterator itself has no internal state being incremented or
changed as you pass through a loop.</p>
<h2>An iterator’s state</h2>
<p>More concretely, what’s the <em>state</em> of an iterator? How the state is
defined, and an iterator’s sequence of states depends on the type of
the iterator itself. It’s best to look at some examples.</p>
<h3>Arrays.</h3>
<p>Arrays are very intuitive iterators. They have states that are just integer
values from 1 to the length of the array. So <code>start</code> returns 1.</p>
<div class="highlight"><pre><span class="n">arr</span> <span class="o">=</span> <span class="p">[</span><span class="s">"one"</span><span class="p">,</span> <span class="s">"two"</span><span class="p">,</span> <span class="s">"three"</span><span class="p">,</span> <span class="s">"four"</span><span class="p">,</span> <span class="s">"five"</span><span class="p">,</span> <span class="s">"six"</span><span class="p">]</span>
<span class="n">start</span><span class="p">(</span><span class="n">arr</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre>1
</pre></div>
<p>The <code>next</code> mapping is <em>i</em> → <em>i+1</em>, and the value of the iterator at any state
<code>i</code> is just <code>a[i]</code>. </p>
<div class="highlight"><pre><span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">:</span><span class="mi">4</span>
<span class="n">println</span><span class="p">(</span><span class="s">"next(arr, </span><span class="si">$</span><span class="s">i) = "</span><span class="p">,</span> <span class="n">next</span><span class="p">(</span><span class="n">arr</span><span class="p">,</span> <span class="n">i</span><span class="p">))</span>
<span class="k">end</span>
</pre></div>
<div class="highlight"><pre>next(arr, 1) = ("one",2)
next(arr, 2) = ("two",3)
next(arr, 3) = ("three",4)
next(arr, 4) = ("four",5)
</pre></div>
<p>If this were a multidimensional array, say 3×2 instead of 6×1, we’d
get the same result; iteration would just proceed along the rows of the matrix.</p>
<p>The <code>done</code> method returns true when the state is <code>i =
length(a) + 1</code>. You might think it’d be <code>length(a)</code>, but recall the for-equivalent while loop
above. Having <code>done</code> return true at the last index of the array would prevent
the loop from executing on the last element. So in our 6-element array, <code>done</code>
is true when the state hits 7.</p>
<div class="highlight"><pre><span class="n">println</span><span class="p">(</span><span class="n">done</span><span class="p">(</span><span class="n">arr</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span> <span class="c"># not yet</span>
<span class="n">println</span><span class="p">(</span><span class="n">next</span><span class="p">(</span><span class="n">arr</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
</pre></div>
<div class="highlight"><pre>false
("six",7)
</pre></div>
<div class="highlight"><pre><span class="n">done</span><span class="p">(</span><span class="n">arr</span><span class="p">,</span> <span class="mi">7</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre>true
</pre></div>
<h3>Ranges</h3>
<p>Ranges have states that looks similar to arrays, except they start at zero.</p>
<div class="highlight"><pre><span class="n">rng</span> <span class="o">=</span> <span class="mi">11</span><span class="p">:</span><span class="mi">20</span> <span class="c"># length 10 range</span>
<span class="n">start</span><span class="p">(</span><span class="n">rng</span><span class="p">)</span> <span class="c"># 0</span>
</pre></div>
<div class="highlight"><pre>0
</pre></div>
<p>But the relationship between the current and next state is the same: <em>i</em> → <em>i+1</em>.</p>
<div class="highlight"><pre><span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">9</span><span class="p">,</span> <span class="mi">10</span><span class="p">]</span>
<span class="n">println</span><span class="p">(</span><span class="s">"next(rng, </span><span class="si">$</span><span class="s">i) = "</span><span class="p">,</span> <span class="n">next</span><span class="p">(</span><span class="n">rng</span><span class="p">,</span> <span class="n">i</span><span class="p">))</span>
<span class="k">end</span>
</pre></div>
<div class="highlight"><pre>next(rng, 0) = (11,1)
next(rng, 1) = (12,2)
next(rng, 9) = (20,10)
next(rng, 10) = (21,11)
</pre></div>
<p>Since we start at zero, the done state is one less than the equivalent array.</p>
<div class="highlight"><pre><span class="n">done</span><span class="p">(</span><span class="n">rng</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre>true
</pre></div>
<h3>Unordered collections: Dicts, Sets, etc.</h3>
<p>Arrays and ranges have a natural order, so the evolution of state is
straightforward. But what about collections such as dictionaries and sets that have no inherent
order? Like in many languages, such things can be iterated over, but the order
of iteration is not easily predictable.</p>
<p>For example, here’s a dictionary:</p>
<div class="highlight"><pre><span class="n">dictit</span> <span class="o">=</span> <span class="p">{:</span><span class="n">one</span> <span class="o">=></span> <span class="mi">1</span><span class="p">,</span> <span class="p">:</span><span class="n">three</span> <span class="o">=></span> <span class="mi">3</span><span class="p">,</span> <span class="p">:</span><span class="n">five</span> <span class="o">=></span> <span class="mi">5</span><span class="p">,</span> <span class="p">:</span><span class="n">five</span> <span class="o">=></span> <span class="s">"five!"</span><span class="p">}</span>
</pre></div>
<div class="highlight"><pre>{:one=>1,:three=>3,:five=>"five!"}
</pre></div>
<p>The starting state isn’t 0 or 1, as would be natural for an ordered collection.</p>
<div class="highlight"><pre><span class="n">s0</span> <span class="o">=</span> <span class="n">start</span><span class="p">(</span><span class="n">dictit</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre>3
</pre></div>
<p>And while <code>next</code> maps state <em>i</em> to state <em>j</em>, the relationship between <em>i</em> and <em>j</em>
is not obvious. Here, while the first state is 3, the second is 11, and the rest
are similarly weird.</p>
<div class="highlight"><pre><span class="n">_</span><span class="p">,</span> <span class="n">s1</span> <span class="o">=</span> <span class="n">next</span><span class="p">(</span><span class="n">dictit</span><span class="p">,</span> <span class="n">s0</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre>((:one,1),11)
</pre></div>
<div class="highlight"><pre><span class="n">_</span><span class="p">,</span> <span class="n">s2</span> <span class="o">=</span> <span class="n">next</span><span class="p">(</span><span class="n">dictit</span><span class="p">,</span> <span class="n">s1</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre>((:three,3),13)
</pre></div>
<div class="highlight"><pre><span class="n">_</span><span class="p">,</span> <span class="n">s3</span> <span class="o">=</span> <span class="n">next</span><span class="p">(</span><span class="n">dictit</span><span class="p">,</span> <span class="n">s2</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre>((:five,"five!"),17)
</pre></div>
<div class="highlight"><pre><span class="n">done</span><span class="p">(</span><span class="n">dictit</span><span class="p">,</span> <span class="n">s3</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre>true
</pre></div>
<p>The states, you probably and correctly suspect, are tied to the internal
implementation of the dictionary, e.g. how the keys are hashed. So the state
doesn’t follow a predictable 1, 2, 3, … order, and what order of elements we
get when iterating is essentially unpredictable.</p>
<p>But because for loops handle the iterator’s states for us, we rarely if ever have to worry about
the representation of an iterator’s state. The for loop implicitly calls the <code>start</code>,
<code>done</code>, and <code>next</code> methods, which do all this bookkeeping for us.</p>
<h2>Iterators and Delayed Evaluation</h2>
<p>While many iterators are collections of data in memory, like Arrays, Dicts, or
Sets, iterators can also represent abstract collections that aren’t held in memory.</p>
<p>Range is a good example. When we iterate over the range <code>1:10</code>, we get the
sequence 1, 2, 3, …, 10. But in memory, this range is comprised of only two
integers, 1 and 10. The values in between are only evaluated when we’re looping over it.</p>
<p>From <a href="">https://github.com/JuliaLang/julia/blob/master/base/range.jl</a>, here’s how a
Range’s iterator protocol is defined:</p>
<div class="highlight"><pre><span class="n">start</span><span class="p">(</span><span class="n">r</span><span class="p">::</span><span class="n">Ranges</span><span class="p">)</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">next</span><span class="p">{</span><span class="n">T</span><span class="p">}(</span><span class="n">r</span><span class="p">::</span><span class="n">Range</span><span class="p">{</span><span class="n">T</span><span class="p">},</span> <span class="n">i</span><span class="p">)</span> <span class="o">=</span> <span class="p">(</span><span class="n">oftype</span><span class="p">(</span><span class="n">T</span><span class="p">,</span> <span class="n">r</span><span class="o">.</span><span class="n">start</span> <span class="o">+</span> <span class="n">i</span><span class="o">*</span><span class="n">step</span><span class="p">(</span><span class="n">r</span><span class="p">)),</span> <span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
<span class="n">next</span><span class="p">{</span><span class="n">T</span><span class="p">}(</span><span class="n">r</span><span class="p">::</span><span class="n">Range1</span><span class="p">{</span><span class="n">T</span><span class="p">},</span> <span class="n">i</span><span class="p">)</span> <span class="o">=</span> <span class="p">(</span><span class="n">oftype</span><span class="p">(</span><span class="n">T</span><span class="p">,</span> <span class="n">r</span><span class="o">.</span><span class="n">start</span> <span class="o">+</span> <span class="n">i</span><span class="p">),</span> <span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
<span class="n">done</span><span class="p">(</span><span class="n">r</span><span class="p">::</span><span class="n">Ranges</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span> <span class="o">=</span> <span class="p">(</span><span class="n">length</span><span class="p">(</span><span class="n">r</span><span class="p">)</span> <span class="o"><=</span> <span class="n">i</span><span class="p">)</span>
</pre></div>
<p>Notice that the <code>next</code> method calculates the value of the iterator in state
<code>i</code>. This is different from an Array iterator, which just reads the element
<code>a[i]</code> from memory.</p>
<p>Iterators that exploit delayed evaluation like this can have important performance
benefits. If we want to iterate over the integers 1 to 10,000, iterating over an
Array means we have to allocate about <span class="caps">80MB</span> to hold it. A Range only requires
16 bytes; the same size as the range 1 to 100,000 or 1 to 100,000,000.</p>
<h3>Application: Iterating over Fibonacci numbers</h3>
<p>Here’s another example of an iterator which computes values on demand, using the
<code>next</code> method to do the work. <code>fibit(n)</code> is an iterator over the first <code>n</code>
Fibonacci numbers. When the iterator is constructed, it doesn’t calculate all of
these numbers. Instead it waits for its <code>next</code> method to be called, providing
the next Fibonacci number depending on the current one.</p>
<div class="highlight"><pre><span class="c"># Iterator produces the first n Fibonacci numbers</span>
<span class="k">immutable</span> <span class="n">FibIt</span><span class="p">{</span><span class="n">T</span><span class="o"><:</span><span class="n">Integer</span><span class="p">}</span>
<span class="n">last2</span><span class="p">::(</span><span class="n">T</span><span class="p">,</span> <span class="n">T</span><span class="p">)</span>
<span class="n">n</span><span class="p">::</span><span class="n">Integer</span>
<span class="k">end</span>
<span class="n">fibit</span><span class="p">(</span><span class="n">n</span><span class="p">::</span><span class="n">Integer</span><span class="p">)</span> <span class="o">=</span> <span class="n">FibIt</span><span class="p">((</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">n</span><span class="p">)</span>
<span class="c"># Specify types, e.g. BigInt to prevent overflow.</span>
<span class="n">fibit</span><span class="p">(</span><span class="n">n</span><span class="p">::</span><span class="n">Integer</span><span class="p">,</span> <span class="n">T</span><span class="p">::</span><span class="n">Type</span><span class="p">)</span> <span class="o">=</span> <span class="n">FibIt</span><span class="p">{</span><span class="n">T</span><span class="p">}((</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">n</span><span class="p">)</span>
<span class="n">Base</span><span class="o">.</span><span class="n">start</span><span class="p">(</span><span class="n">fi</span><span class="p">::</span><span class="n">FibIt</span><span class="p">)</span> <span class="o">=</span> <span class="mi">1</span>
<span class="k">function</span><span class="nf"> Base</span><span class="o">.</span><span class="n">next</span><span class="p">(</span><span class="n">fi</span><span class="p">::</span><span class="n">FibIt</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span>
<span class="k">if</span> <span class="n">state</span> <span class="o">==</span> <span class="mi">1</span>
<span class="k">return</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="k">else</span>
<span class="n">fi</span><span class="o">.</span><span class="n">last2</span> <span class="o">=</span> <span class="n">fi</span><span class="o">.</span><span class="n">last2</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="n">sum</span><span class="p">(</span><span class="n">fi</span><span class="o">.</span><span class="n">last2</span><span class="p">)</span>
<span class="p">(</span><span class="n">fi</span><span class="o">.</span><span class="n">last2</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="n">state</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="n">Base</span><span class="o">.</span><span class="n">done</span><span class="p">(</span><span class="n">fi</span><span class="p">::</span><span class="n">FibIt</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span> <span class="o">=</span> <span class="n">state</span> <span class="o">></span> <span class="n">fi</span><span class="o">.</span><span class="n">n</span>
</pre></div>
<div class="highlight"><pre><span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="n">fibit</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
<span class="n">print</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="s">" "</span><span class="p">)</span>
<span class="k">end</span>
</pre></div>
<div class="highlight"><pre>1 1 2 3 5 8 13 21 34 55
</pre></div>
<h3>Tasks/Co-routines</h3>
<p>This talk of iterators with delayed evaluation may remind Pythonistas of
generators. And Julia has a type that is basically equivalent to Python’s
generators, called Tasks. A Task is constructed by calling the <code>Task()</code>
constructor (or
<code>@task</code> macro) on a function with a <code>produce</code> statement, which issimilar to Python’s
<code>yield</code>.</p>
<p>Instead of using the <code>Fibit</code> type above, we could get an equivalent iterator by
defining a Task that produces sequential Fibonacci numbers.</p>
<div class="highlight"><pre><span class="k">function</span><span class="nf"> fibtask</span><span class="p">(</span><span class="n">n</span><span class="p">::</span><span class="n">Integer</span><span class="p">,</span> <span class="n">T</span><span class="p">::</span><span class="n">Type</span><span class="p">)</span>
<span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="o">=</span> <span class="p">(</span><span class="n">zero</span><span class="p">(</span><span class="n">T</span><span class="p">),</span> <span class="n">one</span><span class="p">(</span><span class="n">T</span><span class="p">))</span>
<span class="n">produce</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="k">function</span><span class="nf"> _it</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">:</span><span class="n">n</span>
<span class="n">produce</span><span class="p">(</span><span class="n">b</span><span class="p">)</span>
<span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="o">=</span> <span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="n">a</span><span class="o">+</span><span class="n">b</span><span class="p">)</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="n">Task</span><span class="p">(</span><span class="n">_it</span><span class="p">)</span>
<span class="k">end</span>
<span class="n">fibtask</span><span class="p">(</span><span class="n">n</span><span class="p">::</span><span class="n">Integer</span><span class="p">)</span> <span class="o">=</span> <span class="n">fibtask</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="kt">Int</span><span class="p">)</span>
</pre></div>
<p>Once we’ve made the task, we get iteration for free.</p>
<div class="highlight"><pre><span class="k">for</span> <span class="n">i</span> <span class="k">in</span> <span class="n">fibtask</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
<span class="n">print</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="s">" "</span><span class="p">)</span>
<span class="k">end</span>
</pre></div>
<div class="highlight"><pre>1 1 2 3 5 8 13 21 34 55
</pre></div>
<p>Whether you create an iterator using a type with the iterator protocol, or by
constructing a Task, is up to you. There are pros and cons to each approach. By
defining your iterator as a specific type, you can dispatch lots of other
functions on it. Here, on the other hand, <code>fibtask</code> is just a <code>Task</code> type, so
defining methods for it means defining methods for all Tasks, which may be
undesirable or infeasible. On the other hand Tasks give you iterators with less
code. Below I’ll show an example of an iterator that’s hard to define with the
iterator protocol methods, but easy to define as a Task. And of course, Tasks
are coroutines, and can be used in those contexts.</p>
<h2>Realizing Iterators without loops</h2>
<p>So far, we’ve talked about iterators in the context of for loops. We saw that
<code>for i = I</code> was a construct for calling <code>I</code><span class="quo">‘</span>s <code>start</code>, <code>done</code> and <code>next</code>
methods, letting us realize and operate on the values in the iterator.</p>
<p>But there are functions which can take iterators as inputs and implicitly iterate over them
to some desired result. This obviates the need for explicit for loops, and can
make for cleaner more functional code. Some examples follow.</p>
<h3><code>collect</code> and <code>reduce</code></h3>
<p>The <code>collect</code> function takes an iterator input, realizes all its values, and
<em>collects</em> them into an array.</p>
<div class="highlight"><pre><span class="n">collect</span><span class="p">(</span><span class="n">fibit</span><span class="p">(</span><span class="mi">10</span><span class="p">))</span>
</pre></div>
<div class="highlight"><pre>10-element Array{Any,1}:
1
1
2
3
5
8
13
21
34
55
</pre></div>
<p>The <code>reduce</code> function similarly realizes the values of an iterator, but then successively
applies an operator to them to give a scalar result.</p>
<div class="highlight"><pre><span class="n">reduce</span><span class="p">(</span><span class="o">+</span><span class="p">,</span> <span class="n">fibit</span><span class="p">(</span><span class="mi">10</span><span class="p">))</span>
</pre></div>
<div class="highlight"><pre>143
</pre></div>
<p>That reduce operation is equivalent to the <code>sum</code> function called with an
iterator argument.</p>
<div class="highlight"><pre><span class="n">sum</span><span class="p">(</span><span class="n">fibit</span><span class="p">(</span><span class="mi">10</span><span class="p">))</span>
</pre></div>
<div class="highlight"><pre>143
</pre></div>
<p>In this next line of code, I compute the sum of the reciprocals of the first
10,000 Fibonacci numbers (which should be close to <a href="http://en.wikipedia.org/wiki/Reciprocal_Fibonacci_constant">this</a>), using <code>collect</code> to first put them into an array.</p>
<div class="highlight"><pre><span class="n">sum</span><span class="p">(</span><span class="mi">1</span> <span class="o">./</span> <span class="n">collect</span><span class="p">(</span><span class="n">BigInt</span><span class="p">,</span> <span class="n">fibit</span><span class="p">(</span><span class="mi">10_000</span><span class="p">,</span> <span class="n">BigInt</span><span class="p">)))</span>
</pre></div>
<div class="highlight"><pre>3.359885666243177553172011302918927179688905133731968486495553815325130318996609
e+00 with 256 bits of precision
</pre></div>
<h3>Comprehensions</h3>
<p>The <code>collect</code> function may remind you of an array comprehension, and it is
similar, but here we see array comprehension don’t work on our iterator:</p>
<div class="highlight"><pre><span class="p">[</span><span class="n">f</span> <span class="k">for</span> <span class="n">f</span> <span class="o">=</span> <span class="n">fibit</span><span class="p">(</span><span class="mi">10</span><span class="p">)]</span>
</pre></div>
<div class="highlight"><pre>no method length(FibIt{Int64})
while loading In[26], in expression starting on line 1
in anonymous at no file
</pre></div>
<p>What’s going on is that the array comprehension wants to allocate an array,
then fill it in with the values of the iterator. Since it doesn’t know the
iterator’s length (how many values it will produce), it doesn’t know how large
an array to allocate.<a name="fnm-arrcomp" href="#fn-arrcomp"
class="footnote-mark">3</a> We can fix this for our Fibonacci iterator by
giving it a <code>length</code> method.</p>
<div class="highlight"><pre><span class="n">Base</span><span class="o">.</span><span class="n">length</span><span class="p">(</span><span class="n">it</span><span class="p">::</span><span class="n">FibIt</span><span class="p">)</span> <span class="o">=</span> <span class="n">it</span><span class="o">.</span><span class="n">n</span>
<span class="p">[</span><span class="n">f</span> <span class="k">for</span> <span class="n">f</span> <span class="o">=</span> <span class="n">fibit</span><span class="p">(</span><span class="mi">10</span><span class="p">)]</span>
</pre></div>
<div class="highlight"><pre>10-element Array{Int64,1}:
1
1
2
3
5
8
13
21
34
55
</pre></div>
<p>Now we can redefine our sum-of-reciprocals using a comprehension instead of <code>collect</code>.</p>
<div class="highlight"><pre><span class="n">sum</span><span class="p">([</span><span class="mi">1</span><span class="o">/</span><span class="n">f</span> <span class="k">for</span> <span class="n">f</span> <span class="o">=</span> <span class="n">fibit</span><span class="p">(</span><span class="mi">10_000</span><span class="p">,</span> <span class="n">BigInt</span><span class="p">)])</span>
</pre></div>
<div class="highlight"><pre>3.359885666243177553172011302918927179688905133731968486495553815325130318996712
e+00 with 256 bits of precision
</pre></div>
<p>What if we tried this with our Fibonacci task?</p>
<div class="highlight"><pre><span class="p">[</span><span class="n">f</span> <span class="k">for</span> <span class="n">f</span> <span class="o">=</span> <span class="n">fibtask</span><span class="p">(</span><span class="mi">10</span><span class="p">)]</span>
</pre></div>
<div class="highlight"><pre>no method length(Task)
while loading In[27], in expression starting on line 1
in anonymous at no file
</pre></div>
<p>We get the same issue; Tasks don’t have a length method. The advantage of using
the <code>FibIt</code> type is that we can easily define a length method for it. We can
only give our Fibonacci task a method if we give all Tasks a length method,
which wouldn’t make sense.</p>
<h2>The Iterator Package</h2>
<p>When we calculated the sum of the reciprocals of Fibonacci number above, we had
to realize the values of the Fibonacci iterator before taking the
reciprocal, and then sum a collection of all those values. Alternatively, we could have called sum on an
iterator that produced <em>1/x</em> for each Fibonacci number <em>x</em>.</p>
<p>One way to do this would be to create a new iterator type, called
<code>ReciprocalFibIt</code>, and given it its own <code>start</code>, <code>next</code>, and <code>done</code> methods. But that
feels a little excessive. Wouldn’t it be nicer to be able to construct that iterator from
the Fibonacci iterator we already have? Essentially saying, “hey, I want another
iterator that gives one over the values of that other iterator.”</p>
<p>That would be an example of what I’ll call a <em>higher-order iterator</em>, which is
an iterator constructed from one or more other iterators. <code>zip</code> and <code>enumerable</code>
are common examples.</p>
<p>It turns out Julia has a neat little package of useful higher-order iterators;
called, obviously, Iterators. In the rest of (this already very long) post, I’ll
explore some of things in the package. Pythonistas will notice similarities with
the itertools module in the Standard Library.</p>
<div class="highlight"><pre><span class="k">using</span> <span class="n">Iterators</span>
</pre></div>
<h2>Imap</h2>
<p>The <code>Imap</code> iterator provides us with our wish above: a new iterator that is the
result of applying a function to the values of an existing iterator. In the case of our
reciprocal Fibonacci numbers, that function is <code>x -> 1/x</code>. </p>
<div class="highlight"><pre><span class="n">recipricalfib</span> <span class="o">=</span> <span class="n">imap</span><span class="p">(</span><span class="n">x</span> <span class="o">-></span> <span class="mi">1</span><span class="o">/</span><span class="n">x</span><span class="p">,</span> <span class="n">fibit</span><span class="p">(</span><span class="mi">10_000</span><span class="p">,</span> <span class="n">BigInt</span><span class="p">))</span> <span class="c"># A new iterator, composed</span>
<span class="c"># from a FibIt</span>
<span class="n">psi</span> <span class="o">=</span> <span class="n">sum</span><span class="p">(</span><span class="n">recipricalfib</span><span class="p">)</span> <span class="c"># No collect needed</span>
</pre></div>
<div class="highlight"><pre>3.359885666243177553172011302918927179688905133731968486495553815325130318996609
e+00 with 256 bits of precision
</pre></div>
<p>So <code>reciprocalfib</code> is itself an iterator, whose values are only realized when
it’s passed to the <code>sum</code> function. We didn’t have to allocate any arrays before
calling sum as with the <code>collect</code> and comprehension examples above.</p>
<h2>An IFilter iterator</h2>
<p>Since we have a map-like iterator, why not a filter?<a name="fnm-filter" href="#fn-filter" class="footnote-mark">4</a> How would it work? Given an
iterator that produces values <em>v1</em>, <em>v2</em>, <em>v3</em>, …, the filter iterator would
only produce the values that met some predicate, skipping any that didn’t.</p>
<p>This isn’t implemented in the Iterators package (because <code>Base.filter</code> will
already do this, see <a href="#fn-filter">footnote 4</a>). It’s a neat idea, but it turns
out to be tricky to define in terms of the iterator protocol. It’s easy with a
Task, though.</p>
<div class="highlight"><pre><span class="k">function</span><span class="nf"> ifilter</span><span class="p">(</span><span class="n">f</span><span class="p">::</span><span class="n">Function</span><span class="p">,</span> <span class="n">itr</span><span class="p">)</span>
<span class="k">function</span><span class="nf"> _it</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="n">itr</span>
<span class="k">if</span> <span class="n">f</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
<span class="n">produce</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="n">Task</span><span class="p">(</span><span class="n">_it</span><span class="p">)</span>
<span class="k">end</span>
</pre></div>
<h3>Application: A list of primes whose digits sum to a prime</h3>
<p>Here’s an example of it in action. We’ll begin with a Range iterator from 1 to
1,000. I want to list all of numbers in that range that are (1) prime and
(2) have digits that sum to a prime. </p>
<p>So <code>ifilter</code> takes the predicate test and the original iterator, then produces only
those values from the original iterator that pass the test. Turns out there are
89 such primes between 1 and 1,000.</p>
<div class="highlight"><pre><span class="k">function</span><span class="nf"> funnyprimetest</span><span class="p">(</span><span class="n">n</span><span class="p">::</span><span class="n">Integer</span><span class="p">)</span>
<span class="n">sumdigits</span> <span class="o">=</span> <span class="n">sum</span><span class="p">([</span><span class="n">parseint</span><span class="p">(</span><span class="n">string</span><span class="p">(</span><span class="n">c</span><span class="p">))</span> <span class="k">for</span> <span class="n">c</span> <span class="k">in</span> <span class="n">string</span><span class="p">(</span><span class="n">n</span><span class="p">)])</span>
<span class="n">isprime</span><span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="o">&</span> <span class="n">isprime</span><span class="p">(</span><span class="n">sumdigits</span><span class="p">)</span>
<span class="k">end</span>
<span class="n">collect</span><span class="p">(</span><span class="n">ifilter</span><span class="p">(</span><span class="n">funnyprimetest</span><span class="p">,</span> <span class="mi">1</span><span class="p">:</span><span class="mi">1000</span><span class="p">))</span>
</pre></div>
<div class="highlight"><pre>89-element Array{Any,1}:
2
3
5
7
11
23
29
41
43
47
61
67
83
⋮
829
863
881
883
887
911
919
937
953
971
977
991
</pre></div>
<h2>Repeat and RepeatForever</h2>
<p>Another surprisingly useful iterator is <code>Repeat</code>, which simply produces an object
some number of times. Here the iterator is just the string “echo!” five times.</p>
<div class="highlight"><pre><span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="n">repeated</span><span class="p">(</span><span class="s">"echo!"</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
<span class="n">println</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
<span class="k">end</span>
</pre></div>
<div class="highlight"><pre>echo!
echo!
echo!
echo!
echo!
</pre></div>
<p>If we didn’t provide the second argument, the result would be an iterator that
goes on infinitely, so its for loop would never terminate. Why would you want
that? I’ll show some examples of its use below.</p>
<h3>Extension: Repeating impure functions</h3>
<p>One thing about the <code>Repeat</code> iterator though, is that the object or value it
repeats is fixed at its construction. If you pass it a called function, it will
call that function once in the constructor, and then repeatedly return the
result of that first call. For pure functions, that’s fine. The first call of
<code>sqrt(100)</code> is the same as the second, third, or ten-thousandth call of
<code>sqrt(100)</code>.</p>
<p>If the function is impure, though, we’ll get undesired results.</p>
<div class="highlight"><pre><span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="n">repeated</span><span class="p">(</span><span class="n">rand</span><span class="p">(),</span> <span class="mi">10</span><span class="p">)</span> <span class="n">println</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="k">end</span>
</pre></div>
<div class="highlight"><pre>0.30142748588653046
0.30142748588653046
0.30142748588653046
0.30142748588653046
0.30142748588653046
0.30142748588653046
0.30142748588653046
0.30142748588653046
0.30142748588653046
0.30142748588653046
</pre></div>
<p>Here, the <code>rand</code> function was called once in the constructor, and that result was repeated again
and again. I’d prefer if I could get 10 separate calls to <code>rand</code>. Here’s one way
to get this to work.</p>
<div class="highlight"><pre><span class="n">Base</span><span class="o">.</span><span class="n">next</span><span class="p">(</span><span class="n">it</span><span class="p">::</span><span class="n">Iterators</span><span class="o">.</span><span class="n">Repeat</span><span class="p">{</span><span class="n">Function</span><span class="p">},</span> <span class="n">state</span><span class="p">)</span> <span class="o">=</span> <span class="n">it</span><span class="o">.</span><span class="n">x</span><span class="p">(),</span> <span class="n">state</span> <span class="o">-</span> <span class="mi">1</span>
<span class="n">Base</span><span class="o">.</span><span class="n">next</span><span class="p">(</span><span class="n">it</span><span class="p">::</span><span class="n">Iterators</span><span class="o">.</span><span class="n">RepeatForever</span><span class="p">{</span><span class="n">Function</span><span class="p">},</span> <span class="n">state</span><span class="p">)</span> <span class="o">=</span> <span class="n">it</span><span class="o">.</span><span class="n">x</span><span class="p">(),</span> <span class="n">nothing</span>
<span class="c"># Note the function isn't called in the constructor;</span>
<span class="c"># the `next` function does this.</span>
<span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="n">repeated</span><span class="p">(</span><span class="n">rand</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span> <span class="n">println</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="k">end</span>
</pre></div>
<div class="highlight"><pre>0.6621100826024566
0.755346320113107
0.021395943367805037
0.7304018818932669
0.22941680891855865
0.966762896262876
0.13729437119070198
0.028788242666101915
0.5584434146272579
0.09166900954689794
</pre></div>
<p>What I’ve done is create new <code>next</code> methods for the <code>Repeat</code> and <code>RepeatForever</code>
iterators. When the object of the iterators is a function, the <code>next</code> methods
call the function. By passing the iterator an uncalled function object, I avoid the call
in the constructor, and defer it to the <code>next</code> method.</p>
<h2>Take and Drop</h2>
<p>The <code>Take</code> iterator only iterates over some specified first values of its input
iterator. It works well in combination with infinite iterators, like <code>RepeatForever</code></p>
<div class="highlight"><pre><span class="n">randsforever</span> <span class="o">=</span> <span class="n">repeated</span><span class="p">(</span><span class="n">rand</span><span class="p">)</span>
<span class="p">[</span><span class="n">r</span> <span class="k">for</span> <span class="n">r</span> <span class="o">=</span> <span class="n">take</span><span class="p">(</span><span class="n">randsforever</span><span class="p">,</span> <span class="mi">10</span><span class="p">)]</span>
</pre></div>
<div class="highlight"><pre>10-element Array{Any,1}:
0.719153
0.660597
0.280763
0.54125
0.427029
0.919311
0.165029
0.796911
0.354417
0.678271
</pre></div>
<div class="highlight"><pre><span class="p">[</span><span class="n">r</span> <span class="k">for</span> <span class="n">r</span> <span class="o">=</span> <span class="n">take</span><span class="p">(</span><span class="n">randsforever</span><span class="p">,</span> <span class="mi">20</span><span class="p">)]</span>
</pre></div>
<div class="highlight"><pre>20-element Array{Any,1}:
0.568741
0.614644
0.49445
0.0942616
0.518134
0.126585
0.961748
0.698277
0.0805089
0.32351
0.797422
0.513762
0.601515
0.616174
0.460832
0.813204
0.172391
0.444915
0.732941
0.0550762
</pre></div>
<p>The <code>Drop</code> iterator, on the other hand, <em>ignores</em> some specified first values of its
input iterator. So, how many values should be printed in this for loop?</p>
<div class="highlight"><pre><span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="n">drop</span><span class="p">(</span><span class="n">take</span><span class="p">(</span><span class="n">randsforever</span><span class="p">,</span> <span class="mi">10_000</span><span class="p">),</span> <span class="mi">9998</span><span class="p">)</span>
<span class="n">println</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
<span class="k">end</span>
</pre></div>
<p>Answer: just the last two, since we take 10,000 random numbers, but drop the first 9,998.</p>
<div class="highlight"><pre>0.26830900957427684
0.5141969888172926
</pre></div>
<h3>Extension: TakeWhile and TakeUntil</h3>
<p>In some cases you may not want to take a fixed number of values from an
iterator, but instead you want to take values until some condition is met.</p>
<p>To accomplish this, I’ll create a <code>TakeWhile</code> iterator, which takes values from
its input iterator so long as they pass some test.</p>
<div class="highlight"><pre><span class="k">immutable</span> <span class="n">TakeWhile</span><span class="p">{</span><span class="n">I</span><span class="p">}</span>
<span class="n">xs</span><span class="p">::</span><span class="n">I</span>
<span class="n">cond</span><span class="p">::</span><span class="n">Function</span>
<span class="k">end</span>
<span class="n">takewhile</span><span class="p">(</span><span class="n">xs</span><span class="p">,</span> <span class="n">cond</span><span class="p">)</span> <span class="o">=</span> <span class="n">TakeWhile</span><span class="p">(</span><span class="n">xs</span><span class="p">,</span> <span class="n">cond</span><span class="p">)</span>
<span class="n">Base</span><span class="o">.</span><span class="n">start</span><span class="p">(</span><span class="n">it</span><span class="p">::</span><span class="n">TakeWhile</span><span class="p">)</span> <span class="o">=</span> <span class="n">start</span><span class="p">(</span><span class="n">it</span><span class="o">.</span><span class="n">xs</span><span class="p">)</span>
<span class="n">Base</span><span class="o">.</span><span class="n">next</span><span class="p">(</span><span class="n">it</span><span class="p">::</span><span class="n">TakeWhile</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span> <span class="o">=</span> <span class="n">next</span><span class="p">(</span><span class="n">it</span><span class="o">.</span><span class="n">xs</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span>
<span class="k">function</span><span class="nf"> Base</span><span class="o">.</span><span class="n">done</span><span class="p">(</span><span class="n">it</span><span class="p">::</span><span class="n">TakeWhile</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span>
<span class="n">i</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">next</span><span class="p">(</span><span class="n">it</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span>
<span class="o">!</span><span class="n">it</span><span class="o">.</span><span class="n">cond</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="o">||</span> <span class="n">done</span><span class="p">(</span><span class="n">it</span><span class="o">.</span><span class="n">xs</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span>
<span class="k">end</span>
<span class="n">tw</span> <span class="o">=</span> <span class="n">takewhile</span><span class="p">(</span><span class="mi">1</span><span class="p">:</span><span class="mi">10</span><span class="p">,</span> <span class="n">x</span> <span class="o">-></span> <span class="n">x</span><span class="o">^</span><span class="mi">2</span> <span class="o"><</span> <span class="mi">25</span><span class="p">)</span>
<span class="n">collect</span><span class="p">(</span><span class="n">tw</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre>4-element Array{Int64,1}:
1
2
3
4
</pre></div>
<p>Let’s also create a <code>TakeUntil</code> iterator, which takes values until it finds one that
passes the test. So the last value produced by this iterator will pass the test
and all values before that will have failed.</p>
<div class="highlight"><pre><span class="k">immutable</span> <span class="n">TakeUntil</span><span class="p">{</span><span class="n">I</span><span class="p">}</span>
<span class="n">xs</span><span class="p">::</span><span class="n">I</span>
<span class="n">cond</span><span class="p">::</span><span class="n">Function</span>
<span class="k">end</span>
<span class="n">takeuntil</span><span class="p">(</span><span class="n">xs</span><span class="p">,</span> <span class="n">cond</span><span class="p">)</span> <span class="o">=</span> <span class="n">TakeUntil</span><span class="p">(</span><span class="n">xs</span><span class="p">,</span> <span class="n">cond</span><span class="p">)</span>
<span class="n">Base</span><span class="o">.</span><span class="n">start</span><span class="p">(</span><span class="n">it</span><span class="p">::</span><span class="n">TakeUntil</span><span class="p">)</span> <span class="o">=</span> <span class="n">start</span><span class="p">(</span><span class="n">it</span><span class="o">.</span><span class="n">xs</span><span class="p">),</span> <span class="n">false</span>
<span class="k">function</span><span class="nf"> Base</span><span class="o">.</span><span class="n">next</span><span class="p">(</span><span class="n">it</span><span class="p">::</span><span class="n">TakeUntil</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span>
<span class="n">i</span><span class="p">,</span> <span class="n">s</span> <span class="o">=</span> <span class="n">next</span><span class="p">(</span><span class="n">it</span><span class="o">.</span><span class="n">xs</span><span class="p">,</span> <span class="n">state</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">it</span><span class="o">.</span><span class="n">cond</span><span class="p">(</span><span class="n">i</span><span class="p">))</span>
<span class="k">end</span>
<span class="k">function</span><span class="nf"> Base</span><span class="o">.</span><span class="n">done</span><span class="p">(</span><span class="n">it</span><span class="p">::</span><span class="n">TakeUntil</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span>
<span class="n">s</span><span class="p">,</span> <span class="n">iscond</span> <span class="o">=</span> <span class="n">state</span>
<span class="n">iscond</span> <span class="o">||</span> <span class="n">done</span><span class="p">(</span><span class="n">it</span><span class="o">.</span><span class="n">xs</span><span class="p">,</span> <span class="n">s</span><span class="p">)</span>
<span class="k">end</span>
</pre></div>
<div class="highlight"><pre><span class="n">collect</span><span class="p">(</span><span class="n">takeuntil</span><span class="p">(</span><span class="mi">1</span><span class="p">:</span><span class="mi">10</span><span class="p">,</span> <span class="n">x</span> <span class="o">-></span> <span class="n">x</span><span class="o">*</span><span class="n">x</span> <span class="o">>=</span> <span class="mi">25</span><span class="p">))</span> <span class="c"># x <= sqrt(25) -> 1:5</span>
</pre></div>
<div class="highlight"><pre>5-element Array{Any,1}:
1
2
3
4
5
</pre></div>
<h3>Application: How long does it take a Poisson process to produce a prime number?</h3>
<p>As an application of the <code>TakeUntil</code> iterator, an experiment. How many draws do
we have to make from a Poisson process until we draw a prime number? For this
example, I’ll use a Poisson with mean 5,000.</p>
<p>In the code, we make a <code>Repeat</code> iterator that repeatedly draws from the
Poisson. We pass this into <code>takeuntil</code> and this creates an iterator that draws
from the Poisson until we find a prime number. While this is happening, we keep track of the
number of steps we took through this iterator.</p>
<div class="highlight"><pre><span class="c"># Draw random integers from a distibrution d until you get a prime number.</span>
<span class="c"># Return the number of draws.</span>
<span class="k">function</span><span class="nf"> primetime</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">dparams</span><span class="p">)</span>
<span class="n">randgen</span> <span class="o">=</span> <span class="p">()</span> <span class="o">-></span> <span class="n">rand</span><span class="p">(</span><span class="n">d</span><span class="p">(</span><span class="n">dparams</span><span class="o">...</span><span class="p">))</span>
<span class="n">tu</span> <span class="o">=</span> <span class="n">takeuntil</span><span class="p">(</span><span class="n">repeated</span><span class="p">(</span><span class="n">randgen</span><span class="p">),</span> <span class="n">isprime</span><span class="p">)</span>
<span class="n">time</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="n">tu</span>
<span class="n">time</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">end</span>
<span class="n">time</span>
<span class="k">end</span>
<span class="n">primetime_poiss5k</span> <span class="o">=</span> <span class="p">()</span> <span class="o">-></span> <span class="n">primetime</span><span class="p">(</span><span class="n">Poisson</span><span class="p">,</span> <span class="mi">5000</span><span class="p">)</span>
</pre></div>
<p>What’s the average wait for a prime? Repeating the experiment 10,000 times, we
find the average number of draws is between 7 and 8.</p>
<div class="highlight"><pre><span class="n">mean</span><span class="p">(</span><span class="n">repeated</span><span class="p">(</span><span class="n">primetime_poiss5k</span><span class="p">,</span> <span class="mi">10_000</span><span class="p">))</span>
</pre></div>
<div class="highlight"><pre>7.6783
</pre></div>
<p>To see the distribution of waiting times, I’ll collect each repetition of the
experiment in an array that we can plot.</p>
<div class="highlight"><pre><span class="n">times</span> <span class="o">=</span> <span class="p">[</span><span class="n">t</span><span class="p">::</span><span class="kt">Int</span> <span class="k">for</span> <span class="n">t</span> <span class="o">=</span> <span class="n">repeated</span><span class="p">(</span><span class="n">primetime_poiss5k</span><span class="p">,</span> <span class="mi">10_000</span><span class="p">)]</span>
</pre></div>
<div id="primetimechart"></div>
<script src="scripts/primetime.js"></script>
<script>
draw("#primetimechart");
</script>
<h2>Partition</h2>
<p>The <code>Partition</code> iterator split its input iterator into pieces, producing an
iterator over iterators. For example we could use it to partition the Range
iterator <code>1:100</code> into two iterators, <code>1:50</code> and <code>51:100</code>. We can also make
overlapping partitions, for example, <code>1:50</code>, <code>2:51</code>, <code>3:52</code>, etc. </p>
<h3>Application: Moving average</h3>
<p>One useful application of overlapping partitions is computing moving
averages. The following code imports Google’s historical stock price from Yahoo
Finance and computes its 60-day moving average. </p>
<p>First, we download the data, creating a 2-D array containing dates, volumes, and prices.</p>
<div class="highlight"><pre><span class="kd">const</span> <span class="n">googdata</span> <span class="o">=</span>
<span class="s">"http://ichart.finance.yahoo.com/table.csv?s=GOOG&d=0&e=7&f=2014&g=d&a=0&b=7&c=2013&ignore=.csv"</span> <span class="o">|></span>
<span class="n">download</span> <span class="o">|></span>
<span class="n">open</span> <span class="o">|></span>
<span class="n">readall</span> <span class="o">|></span>
<span class="n">s</span> <span class="o">-></span> <span class="n">split</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span> <span class="o">|></span>
<span class="n">a</span> <span class="o">-></span> <span class="n">map</span><span class="p">(</span><span class="n">l</span> <span class="o">-></span> <span class="n">split</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="s">","</span><span class="p">),</span> <span class="n">a</span><span class="p">)</span> <span class="o">|></span>
<span class="n">a</span> <span class="o">-></span> <span class="n">filter</span><span class="p">(</span><span class="n">l</span> <span class="o">-></span> <span class="n">contains</span><span class="p">(</span><span class="n">l</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="s">"201"</span><span class="p">),</span> <span class="n">a</span><span class="p">)</span> <span class="o">|></span>
<span class="n">reverse</span> <span class="c"># Dates start at most recent, so reverse for chron order.</span>
</pre></div>
<p>We then create iterators over the dates and closing prices in the Array. These
iteratively extract and parse values from the relevant columns.</p>
<div class="highlight"><pre><span class="n">dates</span> <span class="o">=</span> <span class="n">imap</span><span class="p">(</span><span class="n">r</span> <span class="o">-></span> <span class="n">date</span><span class="p">(</span><span class="n">r</span><span class="p">[</span><span class="mi">1</span><span class="p">]),</span> <span class="n">googdata</span><span class="p">)</span>
<span class="n">close</span> <span class="o">=</span> <span class="n">imap</span><span class="p">(</span><span class="n">r</span> <span class="o">-></span> <span class="n">parsefloat</span><span class="p">(</span><span class="n">r</span><span class="p">[</span><span class="mi">7</span><span class="p">]),</span> <span class="n">googdata</span><span class="p">)</span>
</pre></div>
<p>Now we can make 60-day sub-period partitions and compute the average of
each. Since I’m using <code>imap</code> nothing has been calculated yet. These are all just
iterators promising to do work when called.</p>
<div class="highlight"><pre><span class="n">ma60</span> <span class="o">=</span> <span class="n">imap</span><span class="p">(</span><span class="n">mean</span><span class="p">,</span> <span class="n">partition</span><span class="p">(</span><span class="n">close</span><span class="p">,</span> <span class="mi">60</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="c"># NB: The Julian way to do this would be</span>
<span class="c"># [mean(price[i-59:i]) for i = 60:length(price)]</span>
</pre></div>
<p>With all these useful iterators defined, I can just collect them into arrays for plotting.</p>
<div class="highlight"><pre><span class="n">plot</span><span class="p">(</span><span class="n">layer</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">collect</span><span class="p">(</span><span class="n">dates</span><span class="p">),</span> <span class="n">y</span><span class="o">=</span><span class="n">collect</span><span class="p">(</span><span class="n">close</span><span class="p">),</span> <span class="n">Geom</span><span class="o">.</span><span class="n">line</span><span class="p">),</span>
<span class="n">layer</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">collect</span><span class="p">(</span><span class="n">dates</span><span class="p">)[</span><span class="mi">60</span><span class="p">:</span><span class="k">end</span><span class="p">],</span> <span class="n">y</span> <span class="o">=</span> <span class="n">collect</span><span class="p">(</span><span class="n">ma60</span><span class="p">),</span> <span class="n">Geom</span><span class="o">.</span><span class="n">line</span><span class="p">),</span>
<span class="n">Guide</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">"Price"</span><span class="p">),</span>
<span class="n">Guide</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">"GOOG Daily Stock Price 60-Day Moving Avg."</span><span class="p">))</span>
</pre></div>
<div id="ma60chart"></div>
<script src="scripts/ma60.js"></script>
<script>
draw("#ma60chart");
</script>
<h2>Groupby</h2>
<p>While the <code>Partition</code> iterator makes partitions of specified lengths, the
<code>Groupby</code> iterator splits an iterator based on some condition. One caveat is
that the input iterator has to be sorted in some way on the groupby condition,
so that values with the same condition are adjacent in the iterator. </p>
<h3>Application: Do Labor Force figures follow Benford’s Law?</h3>
<p>In this example, I’m going to look at <a href="http://en.wikipedia.org/wiki/Benford%27s_law">Benford’s Law</a> using the <code>Groupby</code>
iterator. Benford’s Law posits that the leading digits of numbers in many
data sources follows a regular distribution. I’ll use the <code>Groupby</code> iterator to
group the data by first digit and check this.</p>
<p>The data I’ll examine is the size
of the labor force population in each <span class="caps">U.S.</span> county in 2012.</p>
<div class="highlight"><pre><span class="kd">const</span> <span class="n">lfdata</span> <span class="o">=</span> <span class="s">"http://www.bls.gov/lau/laucnty12.txt"</span> <span class="o">|></span>
<span class="n">download</span> <span class="o">|></span>
<span class="n">open</span> <span class="o">|></span>
<span class="n">readall</span> <span class="o">|></span>
<span class="n">s</span> <span class="o">-></span> <span class="n">split</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="s">"</span><span class="se">\r\n</span><span class="s">"</span><span class="p">)</span> <span class="o">|></span>
<span class="n">a</span> <span class="o">-></span> <span class="n">filter</span><span class="p">(</span><span class="n">x</span> <span class="o">-></span> <span class="n">length</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">==</span> <span class="mi">125</span><span class="p">,</span> <span class="n">a</span><span class="p">)</span> <span class="o">|></span> <span class="c"># Rows with data</span>
<span class="n">a</span> <span class="o">-></span> <span class="n">map</span><span class="p">(</span><span class="n">x</span> <span class="o">-></span> <span class="n">strip</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="mi">79</span><span class="p">:</span><span class="mi">92</span><span class="p">]),</span> <span class="n">a</span><span class="p">)</span> <span class="o">|></span> <span class="c"># Column w/ LF data</span>
<span class="n">a</span> <span class="o">-></span> <span class="n">map</span><span class="p">(</span><span class="n">x</span> <span class="o">-></span> <span class="n">replace</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="s">","</span><span class="p">,</span> <span class="s">""</span><span class="p">),</span> <span class="n">a</span><span class="p">)</span> <span class="o">|></span> <span class="c"># 1,000 -> 1000</span>
<span class="n">x</span> <span class="o">-></span> <span class="n">x</span><span class="p">[</span><span class="mi">2</span><span class="p">:</span><span class="k">end</span><span class="p">]</span> <span class="o">|></span> <span class="c"># Remove header</span>
<span class="n">sort</span>
</pre></div>
<p>The analysis is simple with a <code>Groupby</code> iterator. It splits up the data by
leading digit, and then I just calculate the frequency of each leading digit in
the data by taking the length of each leading-digit group as a share of
the total length of the data.</p>
<div class="highlight"><pre><span class="n">dgroups</span> <span class="o">=</span> <span class="n">groupby</span><span class="p">(</span><span class="n">lfdata</span><span class="p">,</span> <span class="n">s</span> <span class="o">-></span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="c"># Groups by first digit</span>
<span class="c"># Extract the digit from the group members</span>
<span class="n">digits</span> <span class="o">=</span> <span class="n">imap</span><span class="p">(</span><span class="n">i</span> <span class="o">-></span> <span class="n">parseint</span><span class="p">(</span><span class="n">string</span><span class="p">(</span><span class="n">i</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="mi">1</span><span class="p">])),</span> <span class="n">dgroups</span><span class="p">)</span>
<span class="c"># Compute the frequency</span>
<span class="n">frequency</span> <span class="o">=</span> <span class="n">imap</span><span class="p">(</span><span class="n">x</span> <span class="o">-></span> <span class="n">length</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">/</span> <span class="n">length</span><span class="p">(</span><span class="n">lfdata</span><span class="p">),</span> <span class="n">dgroups</span><span class="p">)</span>
</pre></div>
<p>Benford’s Law posits that the frequency of digit <em>d</em> in data should be
log<sub>10</sub>(<em>d+1</em>) - log<sub>10</sub>(<em>d</em>). This function prints out a
table of the observed frequencies next to the expected frequencies per Benford’s Law.</p>
<div class="highlight"><pre><span class="n">benfordcheck</span> <span class="o">=</span> <span class="n">function</span><span class="p">(</span><span class="n">obs_freqs</span><span class="p">,</span> <span class="n">digits</span><span class="p">)</span>
<span class="n">pred_freqs</span> <span class="o">=</span> <span class="n">map</span><span class="p">(</span><span class="n">d</span> <span class="o">-></span> <span class="n">log10</span><span class="p">(</span><span class="n">d</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="o">-</span> <span class="n">log10</span><span class="p">(</span><span class="n">d</span><span class="p">),</span> <span class="n">digits</span><span class="p">)</span>
<span class="n">println</span><span class="p">(</span><span class="s">""</span><span class="p">)</span>
<span class="n">println</span><span class="p">(</span><span class="s">"Digit Frequency Compared to Benford's Law"</span><span class="p">)</span>
<span class="n">println</span><span class="p">(</span><span class="s">"========================================="</span><span class="p">)</span>
<span class="n">println</span><span class="p">(</span><span class="s">""</span><span class="p">)</span>
<span class="n">println</span><span class="p">(</span><span class="s">"Digit Observed Expected Difference"</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">o</span><span class="p">,</span> <span class="n">e</span><span class="p">)</span> <span class="k">in</span> <span class="n">zip</span><span class="p">(</span><span class="n">digits</span><span class="p">,</span> <span class="n">obs_freqs</span><span class="p">,</span> <span class="n">pred_freqs</span><span class="p">)</span>
<span class="p">@</span><span class="n">printf</span><span class="p">(</span> <span class="s">"%5d %9.2f %9.2f %11.2f</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">d</span><span class="p">,</span> <span class="mi">100</span><span class="o">*</span><span class="n">o</span><span class="p">,</span> <span class="mi">100</span><span class="o">*</span><span class="n">e</span><span class="p">,</span> <span class="mi">100</span><span class="o">*</span><span class="p">(</span><span class="n">o</span><span class="o">-</span><span class="n">e</span><span class="p">))</span>
<span class="k">end</span>
<span class="k">end</span>
</pre></div>
<p>We can see the labor force data follows Benford’s Law quite closely.</p>
<div class="highlight"><pre><span class="n">benfordcheck</span><span class="p">(</span><span class="n">frequency</span><span class="p">,</span> <span class="n">digits</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre>Digit Frequency Compared to Benford's Law
=========================================
Digit Observed Expected Difference
1 30.09 30.10 -0.01
2 16.46 17.61 -1.15
3 12.02 12.49 -0.48
4 9.72 9.69 0.03
5 8.29 7.92 0.37
6 6.30 6.69 -0.39
7 6.02 5.80 0.23
8 5.84 5.12 0.72
9 5.25 4.58 0.67
</pre></div>
<p>To plot the comparison, I can collect the values from our iterators into a DataFrame.</p>
<div class="highlight"><pre><span class="n">benford_df</span> <span class="o">=</span> <span class="n">DataFrame</span><span class="p">(</span><span class="c"># Extract the digit</span>
<span class="n">digits</span> <span class="o">=</span> <span class="n">collect</span><span class="p">(</span><span class="n">digits</span><span class="p">),</span>
<span class="n">observed</span> <span class="o">=</span> <span class="n">collect</span><span class="p">(</span><span class="n">frequency</span><span class="p">),</span>
<span class="n">expected</span> <span class="o">=</span> <span class="n">map</span><span class="p">(</span><span class="n">d</span> <span class="o">-></span> <span class="n">log10</span><span class="p">(</span><span class="n">d</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="o">-</span> <span class="n">log10</span><span class="p">(</span><span class="n">d</span><span class="p">),</span> <span class="n">digits</span><span class="p">))</span>
</pre></div>
<div id="barchart"></div>
<script src="./scripts/bl.js"></script>
<script>
draw("#barchart");
</script>
<h2>Iterate</h2>
<p>Though its name might be a little confusing, the <code>Iterate</code> iterator is one of my
favorites. It recursively applies a function to a starting value, that is
<code>f(...f(f(f(x)))...)</code>. I come across applications for it all over the place.</p>
<h3>Application: Autoregressive time series processes</h3>
<p>One application is producing autoregressive time series processes. An <span class="caps">AR</span>(1)
process has the form <em>x<sub>t+1</sub> = px<sub>t</sub> + e<sub>t+1</sub></em>, where
<em>e</em> is some random noise. If
We define the function <em>f(x) = px + e</em>, then <em>x<sub>t+2</sub></em> as a function
of <em>x<sub>t<sub></em> is <em>f(f(x<sub>t</sub>))</em>. Subsequent values can be similarly
produced by iteratively applying the function.</p>
<p>First the code for the <span class="caps">AR</span>(1) function itself, along with a helper function for
plotting a realization of the process.</p>
<div class="highlight"><pre><span class="k">function</span><span class="nf"> ar</span><span class="p">(</span><span class="n">phi</span><span class="p">::</span><span class="kt">Float64</span><span class="p">,</span> <span class="n">sigma</span><span class="p">::</span><span class="kt">Float64</span><span class="p">)</span>
<span class="n">x</span> <span class="o">-></span> <span class="n">phi</span> <span class="o">*</span> <span class="n">x</span> <span class="o">+</span> <span class="n">sigma</span> <span class="o">*</span> <span class="n">randn</span><span class="p">()</span>
<span class="k">end</span>
<span class="n">plotar</span><span class="p">(</span><span class="n">arseq</span><span class="p">,</span> <span class="n">title</span><span class="p">)</span> <span class="o">=</span> <span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="mi">1</span><span class="p">:</span><span class="n">length</span><span class="p">(</span><span class="n">arseq</span><span class="p">),</span> <span class="n">y</span><span class="o">=</span><span class="n">arseq</span><span class="p">,</span> <span class="n">Geom</span><span class="o">.</span><span class="n">line</span><span class="p">,</span>
<span class="n">Guide</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">"Time"</span><span class="p">),</span> <span class="n">Guide</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">""</span><span class="p">),</span>
<span class="n">Guide</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="n">title</span><span class="p">))</span>
</pre></div>
<p>Defining a coefficient and a standard deviation for the random variable, I pass
them through a process that creates an iterator that recursively applies the
function, starting with a randomly-drawn initial value. Then I collect 250 values of
that iterator and plot them.</p>
<div class="highlight"><pre><span class="kd">const</span> <span class="n">ar1coef</span> <span class="o">=</span> <span class="mf">0.9</span>
<span class="kd">const</span> <span class="n">ar1sigma</span> <span class="o">=</span> <span class="mf">0.15</span>
<span class="p">(</span><span class="n">ar1coef</span><span class="p">,</span> <span class="n">ar1sigma</span><span class="p">)</span> <span class="o">|></span>
<span class="n">x</span> <span class="o">-></span> <span class="n">apply</span><span class="p">(</span><span class="n">ar</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span> <span class="o">|></span>
<span class="n">f</span> <span class="o">-></span> <span class="n">iterate</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">ar1sigma</span><span class="o">*</span><span class="n">rand</span><span class="p">())</span> <span class="o">|></span>
<span class="n">i</span> <span class="o">-></span> <span class="n">collect</span><span class="p">(</span><span class="n">take</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="mi">250</span><span class="p">))</span> <span class="o">|></span>
<span class="n">s</span> <span class="o">-></span> <span class="n">plotar</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="s">"AR(1) Time Series"</span><span class="p">)</span>
</pre></div>
<div id="ar1chart"></div>
<script src="scripts/ar1.js"></script>
<script>
draw("#ar1chart");
</script>
<p>This idea can be extended an <span class="caps">AR</span>(p) process, where the current value of <em>x</em>
depends on several past values. Whereas the coefficient was a scalar in the
<span class="caps">AR</span>(1) model, it’s a matrix now, but the formula is otherwise the same.</p>
<div class="highlight"><pre><span class="k">function</span><span class="nf"> ar</span><span class="p">(</span><span class="n">coeffs</span><span class="p">::</span><span class="n">AbstractVector</span><span class="p">{</span><span class="kt">Float64</span><span class="p">},</span> <span class="n">sigma</span><span class="p">::</span><span class="kt">Float64</span><span class="p">)</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">length</span><span class="p">(</span><span class="n">coeffs</span><span class="p">)</span>
<span class="n">Phi</span> <span class="o">=</span> <span class="p">[</span><span class="n">coeffs</span><span class="o">'</span><span class="p">,</span> <span class="n">eye</span><span class="p">(</span><span class="n">p</span><span class="p">)[</span><span class="mi">1</span><span class="p">:(</span><span class="k">end</span><span class="o">-</span><span class="mi">1</span><span class="p">),:]]</span>
<span class="n">Sigma</span> <span class="o">=</span> <span class="p">[</span><span class="n">sigma</span><span class="p">,</span> <span class="n">zeros</span><span class="p">(</span><span class="n">p</span><span class="o">-</span><span class="mi">1</span><span class="p">)]</span>
<span class="n">x</span> <span class="o">-></span> <span class="n">Phi</span> <span class="o">*</span> <span class="n">x</span> <span class="o">+</span> <span class="n">Sigma</span> <span class="o">.*</span> <span class="n">randn</span><span class="p">(</span><span class="n">p</span><span class="p">)</span>
<span class="k">end</span>
</pre></div>
<p>For an example, here’s 250-periods simulated from an <span class="caps">AR</span>(3) model.</p>
<div class="highlight"><pre><span class="kd">const</span> <span class="n">ar3coeffs</span> <span class="o">=</span> <span class="p">[</span><span class="o">.</span><span class="mi">9</span><span class="p">,</span> <span class="o">-.</span><span class="mi">1</span><span class="p">,</span> <span class="o">-.</span><span class="mi">25</span><span class="p">]</span>
<span class="kd">const</span> <span class="n">ar3sigma</span> <span class="o">=</span> <span class="mf">0.15</span>
<span class="p">(</span><span class="n">ar3coeffs</span><span class="p">,</span> <span class="n">ar3sigma</span><span class="p">)</span> <span class="o">|></span>
<span class="n">x</span> <span class="o">-></span> <span class="n">apply</span><span class="p">(</span><span class="n">ar</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span> <span class="o">|></span>
<span class="n">f</span> <span class="o">-></span> <span class="n">iterate</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">ar3sigma</span><span class="o">*</span><span class="n">rand</span><span class="p">(</span><span class="mi">3</span><span class="p">))</span> <span class="o">|></span>
<span class="n">i</span> <span class="o">-></span> <span class="n">map</span><span class="p">(</span><span class="n">first</span><span class="p">,</span> <span class="n">take</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="mi">250</span><span class="p">))</span> <span class="o">|></span>
<span class="n">s</span> <span class="o">-></span> <span class="n">plotar</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="s">"AR(3) Time Series"</span><span class="p">)</span>
</pre></div>
<div id="ar3chart"></div>
<script src="scripts/ar3.js"></script>
<script>
draw("#ar3chart");
</script>
<h2>Conclusion</h2>
<p>Most iteration you’ll see in the wild uses simple collections or ranges as the
iterator, performing extensive work inside the loop. Sometimes our problem can be
better expressed using more complicated iterators whose structure represents the
logic of our iteration. One thing to notice in all the examples was that once
the iterators were defined, there was very little to do after iterating over
them. Typically I was just collecting the iteration values into an array, or
reducing them with an operation to a scalar result. We were also able to build
the problems in such a way that calculation of values in the iterators was
delayed until absolutely necessary.</p>
<p>There are tradeoffs to this sort of style, and much of the stuff in this
post was more cute than practical. But it was a fun exploration of how to create types
that interact with protocols in Julia. Julia’s type system and dispatch design
are very powerful and interesting, and gives programmers a lot of flexibility in
expressing their problems.</p>
<hr />
<ol class="footnotes">
<li><a name="fn-multdisp"></a>
While
we’ll see lots of examples of extending Julia’s base functions dispatched on
newly-defined types, we won’t see much <i>multiple dispatch</i>, which is an
important design feature of Julia. In fact, pretty much everything here could be
implemented in a single-dispatch <span class="caps">OO</span> language.
<a href="#fnm-multdisp"><i class="fa fa-level-up"></i></a></li>
<li><a name="fn-iterable"></a>
Pythonistas
may be thinking about the distinction between <i>iterators</i> and <i>iterables</i>. (See,
e.g. <a href=" http://stackoverflow.com/questions/9884132/understanding-pythons-iterator-iterable-and-iteration-protocols-what-exact">this Stack Overflow thread</a>.) That distinction doesn’t really apply to
Julia. So I won’t use the term iterable here, and I’ll define an iterator in the two
ways discussed above: (1) it is valid in a <code>for i = I</code> statement, and (2) it
implements the iterator protocol.
<a href="#fnm-iterable"><i class="fa fa-level-up"></i></a></li>
<li><a name="fn-arrcomp"></a>
This limitation seems
to come from the idea that only iterators with known lengths can be counted on
to produce multidimensional arrays. This may be changed in future versions of
Julia. See, e.g. <a href="https://github.com/JuliaLang/julia/issues/550">Issue #550</a>. The <code>collect</code> function uses the
<code>push!</code> function to dynamically allocate the array, but <code>collect</code> can only give
a 1-D Array output, whereas comprehensions can be multidimensional.
<a href="#fnm-arrcomp"><i class="fa fa-level-up">
</i></a></li>
<li><a name="fn-filter"></a>
Actually, Julia’s <code>filter</code> function already does this. If you pass
that function a predicate or condition function and an iterator, it produce a
<code>Filter</code> object that you can then iterate over. This is different from
<code>map</code> which can take an input iterator, but returns the result of
mapping the function immediately.
<a href="#fnm-filter"><i class="fa fa-level-up">
</i></a></li>
</ol>Pardon the dust2013-09-30T00:00:00-04:00Carltag:slendermeans.org,2013-09-30:pardon-the-dust.html<p><strong> Update 9/10/2013 </strong> New posts are going up on the blog, but I’m going to keep this post at the top for a while. Consider the site in beta for the moment, and please use the comment section of this post to report any issues. If you’re using <span class="caps">IE</span> to try and view the site, I’m sorry. But I’m not that sorry.</p>
<p><br></p>
<p><strong> Update 9/3/2013 </strong> Things should be working reasonably well. A few kinks to work out, and I have to migrate the former site’s comments, but the current site is pretty much ready to go.</p>
<p><br></p>
<p>This is the new home for my blog, <em>Slender Means</em>. It’s currently in-progress, and I’m still finishing up the design, and fixing weird links and typos from the Wordpress to Pelican migration.</p>
<p>In the meantime, a more usable version sits at the old home: <a href="http://slendrmeans.wordpress.com">http://slendrmeans.wordpress.com</a>.</p>
<p>Thanks for visiting!</p>
<p>-c.</p>Machine Learning for Hackers Chapter 8: Principal Components Analysis2013-09-06T17:30:00-04:00Carltag:slendermeans.org,2013-09-06:ml4h-ch8.html<p>The <a href="http://nbviewer.ipython.org/urls/raw.github.com/carljv/Will_it_Python/master/MLFH/ch8/ch8.ipynb">code for Chapter 8</a> has been sitting around for a long time now. Let’s blow the dust off and check it out. One thing before we start: explaining <span class="caps">PCA</span> well is kinda hard. If any experts reading feel like I’ve described something imprecisely (and have a better description), I’m very open to suggestions.</p>
<h2>Introduction</h2>
<p>Chapter 8 is about <em>Principal Components Analysis</em> (<span class="caps">PCA</span>), which the authors perform on data with time series of prices for 24 stocks. In very broad terms, <span class="caps">PCA</span> is about projecting many real-life, observed variables onto a smaller number of “abstract” variables, the principal components. Principal components are selected in order to best preserve the variation and correlation of the original variables. For example, if we have 100 variables in our data, which are all highly correlated, we can project them down to just a few principal components—-i.e., the high correlation between them can be imagined as coming from an underlying factor that drives all of them, with some other less important factors driving their differences. When variables aren’t highly correlated, more principal components are needed to describe them well.</p>
<p>As you might imagine, <span class="caps">PCA</span> can be a very effective way of dealing with multi-collinearity that crops up in datasets with lots of variables. The downside is that <span class="caps">PCA</span> is just a mechanical process that is independent of the phenomenon we’re studying; the “principal components” we find don’t have to have any real-world meaning—-they’re just mathematical constructs. Sometimes we can give meaningful interpretations to the principal components by analogizing them to real underlying factors that theoretically drive our data. But this can be tricky, from both a technical and epistemological standpoint.</p>
<p>For the stocks the authors analyze, they ultimately try and reduce their description to a single principal component, which they interpret as a kind of “market-wide” factor, and compare with a broad market index (here the <span class="caps">DJIA</span>). This is a not uncommon application of <span class="caps">PCA</span> in stock analysis. But they’ve got a technical problem here.</p>
<p>To perform <span class="caps">PCA</span>, your data have to have a meaningful covariance matrix (or correlation matrix, but the conditions are equivalent). They analyze stock <em>prices</em>, which are non-stationary time series variables. This means their covariance matrices change with time, so you can’t really estimate a meaningful covariance matrix from a sample of data. Your estimator implicitly assumes the data are stationary, so your estimated covariance matrix is meaningless. If we calculate the stock <em>returns</em> in the data, though, we can do <span class="caps">PCA</span> properly, and we’ll see the relationship of the resulting principal component with the broad market index is much cleaner.</p>
<p>If you’re comfortable with <span class="caps">PCA</span> already, you don’t really have to worry about the conceptual content of this chapter. If you’re not, my advice it to take this chapter as a decent toy example of where and why one uses <span class="caps">PCA</span>, but don’t apply what’s done here elsewhere without learning more first. I’m not going to explain <span class="caps">PCA</span> in any detail; I just want to show where <span class="caps">PCA</span> functions live in the Python ecosystem and how they work. But, like most machine learning techniques, it shouldn’t be used at home without adult supervision.</p>
<p>As usual the IPython notebook lives at the Github repo <a href="https://github.com/carljv/Will_it_Python/blob/master/MLFH/ch8/ch8.ipynb">here</a>, and can be viewed via nbviewer <a href="http://nbviewer.ipython.org/urls/raw.github.com/carljv/Will_it_Python/master/MLFH/ch8/ch8.ipynb">here</a>.</p>
<h2>Stock data munging</h2>
<p>The raw data are in a long format, with (no. of stocks) × (no. of days) rows, and three columns (a date, a stock ticker and a price for that ticker on that day). This sort of dataset lends itself to a pandas DataFrame with a hierarchical index—and since there’s only one variable in the data (the price), we’ll <code>squeeze</code> the DataFrame to get a Series. The Dow Jones data, containing just one ticker, is more straightforward.</p>
<div class="highlight"><pre><span class="n">prices_long</span> <span class="o">=</span> <span class="n">read_csv</span><span class="p">(</span><span class="s">'data/stock_prices.csv'</span><span class="p">,</span> <span class="n">index_col</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span>
<span class="n">parse_dates</span> <span class="o">=</span> <span class="bp">True</span><span class="p">,</span> <span class="n">squeeze</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="n">dji_all</span> <span class="o">=</span> <span class="n">read_csv</span><span class="p">(</span><span class="s">'data/dji.csv'</span><span class="p">,</span> <span class="n">index_col</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">parse_dates</span> <span class="o">=</span> <span class="bp">True</span><span class="p">,</span>
<span class="n">squeeze</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
</pre></div>
<p>With the stock data indexed this way, it’s easy to create a <code>date</code> × <code>ticker</code> DataFrame with prices as entries—-we just use <code>unstack</code>.</p>
<div class="highlight"><pre><span class="n">prices</span> <span class="o">=</span> <span class="n">prices_long</span><span class="o">.</span><span class="n">unstack</span><span class="p">()</span>
</pre></div>
<p>Since we’ll ultimately want to perform this analysis with price returns, I’m going to create a similar dataset, just with returns instead of prices (note this will have one less day of data, since we don’t know the return for the first day in the data).</p>
<div class="highlight"><pre>calc_returns = lambda x: np.log(x / x.shift(1))[1:]
returns = prices.apply(calc_returns)
</pre></div>
<p>Note I’m using log returns here. Pandas DataFrames have a <code>pct_change</code> method that would provide another way of computing returns.</p>
<p>The authors’ <span class="caps">PCA</span> strategy here is to extract a “stock index” factor from the stock data by using the first principal component—-this is the single principal component that captures the most variation in the underlying data.</p>
<p>The function <code>make_pca_index</code> is going to extract this first principal component, using the <code>PCA</code> function in scikit-learn’s <code>sklearn.decomposition</code> module. This is not the only way to get a <span class="caps">PCA</span> in Python—-indeed <span class="caps">PCA</span> is mechanically just an eigen-decomposition of the data’s correlation or covariance, so you could do this all from scratch in Numpy. The scikit-learn implementation, though, gives us a convenient <code>PCA</code> object to work with. And as usual with scikit-learn, the <a href="http://scikit-learn.org/stable/modules/decomposition.html#pca">documentation</a> is very good.</p>
<p>The <code>PCA</code> function works with a covariance or correlation matrix. We’re going to use a correlation matrix here; and our function will just take either stock price or return data, compute its correlation, then find the first principal component of the data. Notice the sign change I do at the end there—-the component ended up being negatively related to the data (i.e. when this factor goes down, the data go up, etc.). <span class="caps">PCA</span> results are typically sign- and scale-invariant; hence the problems with interpretation. We’ll make our resulting index a “postive” one by just reversing the sign.</p>
<div class="highlight"><pre><span class="k">def</span> <span class="nf">make_pca_index</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">scale_data</span> <span class="o">=</span> <span class="bp">True</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Compute the correlation matrix of a set of stock data, and return</span>
<span class="sd"> the first principal component.</span>
<span class="sd"> By default, the data are scaled to have mean zero and variance one</span>
<span class="sd"> before performing the PCA.</span>
<span class="sd"> '''</span>
<span class="k">if</span> <span class="n">scale_data</span><span class="p">:</span>
<span class="n">data_std</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">scale</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">data_std</span> <span class="o">=</span> <span class="n">data</span>
<span class="n">corrs</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">data_std</span><span class="o">.</span><span class="n">cov</span><span class="p">())</span>
<span class="n">pca</span> <span class="o">=</span> <span class="n">PCA</span><span class="p">(</span><span class="n">n_components</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">corrs</span><span class="p">)</span>
<span class="c"># We end up getting a negative value for the index, so we'll reverse</span>
<span class="c"># the sign to have it be more intuitive.</span>
<span class="n">mkt_index</span> <span class="o">=</span> <span class="o">-</span><span class="n">scale</span><span class="p">(</span><span class="n">pca</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">data_std</span><span class="p">))</span>
<span class="k">return</span> <span class="n">mkt_index</span>
</pre></div>
<h3>A <span class="caps">PCA</span> index with price data</h3>
<p>Let’s copy the authors and make an index from raw price data. Since prices don’t have meaningully-estimated correlations, this isn’t really correct, but it’s useful to compare with what’s in the book.</p>
<div class="highlight"><pre><span class="n">price_index</span> <span class="o">=</span> <span class="n">make_pca_index</span><span class="p">(</span><span class="n">prices</span><span class="p">)</span>
</pre></div>
<p>To see what’s going on, let’s make two plots: a scatter plot of our <span class="caps">PCA</span> index with the <span class="caps">DJIA</span>, and a time series plot with the two indices. These correspond to figures 8-4 and 8-5 in the book.</p>
<div class="highlight"><pre><span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span> <span class="o">=</span> <span class="p">(</span><span class="mi">17</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">121</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">price_index</span><span class="p">,</span> <span class="n">scale</span><span class="p">(</span><span class="n">dji</span><span class="p">),</span> <span class="s">'k.'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'PCA index'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Dow Jones Index'</span><span class="p">)</span>
<span class="n">ols_fit</span> <span class="o">=</span> <span class="n">OLSreg</span><span class="p">(</span><span class="n">scale</span><span class="p">(</span><span class="n">dji</span><span class="p">),</span> <span class="n">price_index</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">price_index</span><span class="p">,</span> <span class="n">ols_fit</span><span class="o">.</span><span class="n">fittedvalues</span><span class="p">,</span> <span class="s">'r-'</span><span class="p">,</span>
<span class="n">label</span> <span class="o">=</span> <span class="s">'R2 = </span><span class="si">%4.3f</span><span class="s">'</span> <span class="o">%</span> <span class="nb">round</span><span class="p">(</span><span class="n">ols_fit</span><span class="o">.</span><span class="n">rsquared</span><span class="p">,</span> <span class="mi">3</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span> <span class="o">=</span> <span class="s">'upper left'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">122</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">dates</span><span class="p">,</span> <span class="n">price_index</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'PCA Price Index'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">dates</span><span class="p">,</span> <span class="n">scale</span><span class="p">(</span><span class="n">dji</span><span class="p">),</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'DJ Index'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span> <span class="o">=</span> <span class="s">'upper left'</span><span class="p">)</span>
</pre></div>
<p><a class = "image" href="../images/pca_price.png">
<img src="../images/pca_price.png" width=400 />
</a></p>
<p>This actually seems to look okay, and wouldn’t really alert us to any problems if we didn’t know any better. Let’s repeat the exercise with returns, though.</p>
<h3>A <span class="caps">PCA</span> index with returns data</h3>
<p>Since returns are stationary, we can estimate a meaningful correlation matrix, and our <span class="caps">PCA</span> will make more sense.</p>
<div class="highlight"><pre><span class="n">returns_index</span> <span class="o">=</span> <span class="n">make_pca_index</span><span class="p">(</span><span class="n">returns</span><span class="p">)</span>
</pre></div>
<p>And the plots:</p>
<p><a class = "image" href="../images/pca_returns.png">
<img src="../images/pca_returns.png" width=400 />
</a></p>
<p>Looking at these, we see a much more straightforward linear relationship between the returns to the <span class="caps">DJIA</span> and the <span class="caps">PCA</span> index derived from stock returns.</p>
<h3>Explained variance</h3>
<p>Since the principal components are just eigenvalues, there will be as many of them as their are columns in our data (here 24). As we add components we explain more and more of the original correlation matrix. Once we add all 24 the amount of variation/correlation explained is 100%—-all we’ve done is define a rotation of the matrix, so there’s no information lost. But a plot of the cumulative explained variance as we add principal components can help us to see how far and how reliably the data can be reduced.</p>
<div class="highlight"><pre><span class="n">plt</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">arange</span><span class="p">(</span><span class="mi">24</span><span class="p">)</span> <span class="o">+</span> <span class="o">.</span><span class="mi">5</span><span class="p">,</span> <span class="n">pca_fit</span><span class="o">.</span><span class="n">explained_variance_ratio_</span><span class="o">.</span><span class="n">cumsum</span><span class="p">())</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylim</span><span class="p">((</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'No. of principal components'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Cumulative variance explained'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">grid</span><span class="p">(</span><span class="n">axis</span> <span class="o">=</span> <span class="s">'y'</span><span class="p">,</span> <span class="n">ls</span> <span class="o">=</span> <span class="s">'-'</span><span class="p">,</span> <span class="n">lw</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'white'</span><span class="p">)</span>
</pre></div>
<p><a class="image" href="../images/pca_variance.png">
<img src="../images/pca_variance.png" width=400 />
</a></p>
<h3>Factor loadings</h3>
<p>We can also check out the loadings of the principal component across the stocks. What this shows us is how a change in the relates to the stocks in our data. For example a a component going up might cause half the stock returns in the data to go up and half to go down (it would positively load on some and negatively load on others.) We would expect, intuitively, a factor representing “the market,” as we think our first component does, to load on our stocks in the same direction, and roughly the same magnitude. And this is basically what we see.</p>
<div class="highlight"><pre><span class="n">plt</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">arange</span><span class="p">(</span><span class="mi">24</span><span class="p">),</span> <span class="n">pca_fit</span><span class="o">.</span><span class="n">components_</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
</pre></div>
<p><a class="image" href="../images/pca_loadings.png">
<img src="../images/pca_loadings.png" width=400 />
</a></p>
<h2>Conclusion</h2>
<p>Since <span class="caps">PCA</span> is such a widely-used and fundamental technique, it’s important to know how to do it in Python, and the scikit-learn implementation is a good one. Check out the documentation <a href="http://scikit-learn.org/stable/modules/decomposition.html#pca">here</a>. Of course, like any statistical technique, <span class="caps">PCA</span> can definitely be misused, or at least easily misintepreted, so handle with care.</p>I’ve seen the best minds of my generation destroyed by Matlab …2013-05-11T16:52:00-04:00Carltag:slendermeans.org,2013-05-11:julia-loops.html<p>(Note: this is very quick and not well thought out. Mostly a
conversation starter as opposed to any real thesis on the subject.)</p>
<p>This post is a continuation of a Twitter conversation <a href="https://twitter.com/johnmyleswhite/status/332920041626554369">here</a>, started
when John Myles White poked the hornets’ nest. (Python’s nest? Where do
Pythons live?)</p>
<p><img src="../images/jmw_tweet.jpg" width=450px /></p>
<p>The gist with John’s code is <a href="https://gist.github.com/johnmyleswhite/5556201">here</a>.</p>
<p>This isn’t a very thoughtful post. But the conversation was becoming
sort of a shootout and my thoughts (half-formed as they are) were a bit
longer than a tweet. Essentially, I think the Python performance
shootouts—PyPy, Numba, Cython—are missing the point.</p>
<p>The point is, I think, that loops are a crutch. A 3-nested for loop in
Julia that increments a counter takes 8 lines of code (1 initialize
counter, 3 for statements, 1 increment statement, 3 end statements).
Only one of those lines tells me what the code does.</p>
<p>But most scientific programmers learned to code in imperative languages
and that style of thinking and coding has become natural. I’ve often
seen comments like this:</p>
<p><img alt="forloop_tweet" src="../images/forloop_tweet.jpg" /></p>
<p>Which I think simply equates readability with familiarity. That isn’t
wrong, but it isn’t the whole story.</p>
<p>Anyway, a lot of the responses to John’s code were showing that, hey,
you can get fast loops in Python, with either JITing (PyPy, Numba) or
Cython. So here are my thoughts:</p>
<p>1. Cython is great. I’ve used it with great success myself. But Julia
gives me fast loops while keeping the dynamic typing; i.e., I’m still
writing in Julia. Cython is a manifestation of what the Julia developers
call the “two-language problem.” My programmer-productivity happens in
the slow, dynamic language, and I swap to a more painful language for
critical bottlenecks and glue the two together. Cython is a more
pleasant manifestation of the problem, especially since it lets you
evolve in an exploratory, piece-meal way from your first language to
your second language. But you still end up with code that is nice
dynamic-typing and abstractions on the outside; gross static typing and
low-level imperative stuff on the inside. (And Cython examples are often
clean and simple, but the code can get hairy very quickly.)</p>
<p>2. One of the nice things about the slow for loops in Python and R is
that they force you to think about other ways to express your problem. R
and Python programmers start thinking about how they can exploit arrays
and other ADTs, and higher-order functions to express they’re problem.
Avoiding the loop performance hit is the first reason, but then many of
them start to realize they like their code better this way. The
adjustment is hard at first, but once you get their, it’s hard to go back.</p>
<p>Forget about the Numpy, PyPy, Cython solutions to John’s problem. I
think it’s safe to say his original pure Python code would be considered
pretty un-Pythonic, to the extent that’s a thing. Python programmers are
discouraged from that style of writing-C-in-Python, for both performance
reasons, and conceptual reasons. Python programmers just think the
alternatives (e.g. list comprehensions) are more expressive and
maintainable. They’re not avoiding for loops because they’re slow: they
don’t <strong>want</strong> to write for loops.</p>
<p>Maybe Julia is the answer to this
problem. Since list comprehensions, higher-order-functions (applies,
maps, etc.) wrap imperative loops, and Julia loops are fast, then these
things can be written in Julia and be fast.</p>
<p>But that requires some thought about how
Julia devs want Julia programmers to program. Julia is great and
really promising, and it’s got an opportunity to let scientific
programmers really raise their game. But I’d hate the big pitch for
Julia to be: hey, you can write fast loops! And it would basically
become a refuge for people who never learned to properly code R and are
are fed up with slow loops, or for Matlab guys who’s licenses ran out.</p>Machine Learning for Hackers Chapter 7: Numerical optimization with deterministic and stochastic methods2013-02-12T18:51:00-05:00Carltag:slendermeans.org,2013-02-12:ml4h-ch7.html<h2>Introduction</h2>
<p>Chapter 7 of <em>Machine Learning for Hackers</em> is about numerical
optimization. The authors organize the chapter around two examples of
optimization. The first is a straightforward least-squares problem like
that we’ve encountered already doing linear regressions, and is amenable
to standard iterative algorithms (e.g. gradient descent). The second is
a problem with a discrete search space, not clearly differentiable, and
so lends itself to a stochastic/heuristic optimization technique (though
we’ll see the optimization problem is basically artificial). The first
problem gives us a chance to play around with Scipy’s optimization
routines. The second problem has us hand-coding a Metropolis algorithm;
this doesn’t show off much new Python, but it’s fun nonetheless.</p>
<p>The notebook for this chapter is at the github report <a href="https://github.com/carljv/Will_it_Python/tree/master/MLFH/ch7">here</a>, or you
can view it online via nbviewer <a href="http://nbviewer.ipython.org/urls/raw.github.com/carljv/Will_it_Python/master/MLFH/ch7/ch7.ipynb">here</a>.</p>
<h2>Ridge regression by least-squares</h2>
<p>In <a href="../ml4h-ch6.html">chapter 6</a> we estimated <span class="caps">LASSO</span> regressions, which added an L1
penalty on the parameters to the <span class="caps">OLS</span> loss-function. The ridge regression
works the same way, but applies an L2 penalty to the parameters. The
ridge regression is a somewhat more straightforward optimization
problem, since the L2 norm we use gives us a differentiable loss function.</p>
<p>In this example, we’ll regress weight on height, similar to <a href="http://nbviewer.ipython.org/urls/raw.github.com/carljv/Will_it_Python/master/MLFH/ch5/ch5.ipynb">chapter
5</a>. We can specify the loss (sum of squared errors) function for the
ridge regression with the following function in Python:</p>
<div class="highlight"><pre><span class="n">y</span> <span class="o">=</span> <span class="n">heights_weights</span><span class="p">[</span><span class="s">'Weight'</span><span class="p">]</span><span class="o">.</span><span class="n">values</span>
<span class="n">Xmat</span> <span class="o">=</span> <span class="n">sm</span><span class="o">.</span><span class="n">add_constant</span><span class="p">(</span><span class="n">heights_weights</span><span class="p">[</span><span class="s">'Height'</span><span class="p">],</span> <span class="n">prepend</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">ridge_error</span><span class="p">(</span><span class="n">params</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">Xmat</span><span class="p">,</span> <span class="n">lam</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Compute SSE of the ridge regression.</span>
<span class="sd"> This is the normal regression SSE, plus the</span>
<span class="sd"> L2 cost of the parameters.</span>
<span class="sd"> '''</span>
<span class="n">predicted</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">Xmat</span><span class="p">,</span> <span class="n">params</span><span class="p">)</span>
<span class="n">sse</span> <span class="o">=</span> <span class="p">((</span><span class="n">y</span> <span class="o">-</span> <span class="n">predicted</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
<span class="n">sse</span> <span class="o">+=</span> <span class="n">lam</span> <span class="o">*</span> <span class="p">(</span><span class="n">params</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
<span class="k">return</span> <span class="n">sse</span>
</pre></div>
<p>The authors use R’s <code>optim</code> function, which defaults to the Nelder-Mead
simplex algorithm. This algorithm doesn’t use any gradient or Hessian
information to optimize the function. We’ll want to try out some
gradient methods, though. Even though the functions for these methods
will compute numerical gradients and Hessians for us, for the ridge
problem these are easy enough to specify explicitly.</p>
<div class="highlight"><pre><span class="k">def</span> <span class="nf">ridge_grad</span><span class="p">(</span><span class="n">params</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">Xmat</span><span class="p">,</span> <span class="n">lam</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> The gradiant of the ridge regression SSE.</span>
<span class="sd"> '''</span>
<span class="n">grad</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">Xmat</span><span class="o">.</span><span class="n">T</span><span class="p">,</span> <span class="n">Xmat</span><span class="p">),</span> <span class="n">params</span><span class="p">)</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">Xmat</span><span class="o">.</span><span class="n">T</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">grad</span> <span class="o">+=</span> <span class="n">lam</span> <span class="o">*</span> <span class="n">params</span>
<span class="n">grad</span> <span class="o">*=</span> <span class="mi">2</span>
<span class="k">return</span> <span class="n">grad</span>
<span class="k">def</span> <span class="nf">ridge_hess</span><span class="p">(</span><span class="n">params</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">Xmat</span><span class="p">,</span> <span class="n">lam</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd">The hessian of the ridge regression SSE.</span>
<span class="sd">'''</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">Xmat</span><span class="o">.</span><span class="n">T</span><span class="p">,</span> <span class="n">Xmat</span><span class="p">)</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">eye</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="o">*</span> <span class="n">lam</span>
</pre></div>
<p>Like the <span class="caps">LASSO</span> regressions we worked with in <a title>chapter 6</a>, the
ridge requires a penalty parameter to weight the L2 cost of the
coefficient parameters (called <code>lam</code> in the functions above; <code>lambda</code> is
a keyword in Python). The authors assume we’ve already found an
appropriate value via cross-validation, and that value is 1.0.</p>
<p>We can now try to minimize the loss function with a couple of different
algorithms. First the Nelder-Mead simplex, which should correspond to
the authors’ use of <code>optim</code> in R.</p>
<div class="highlight"><pre><span class="c"># Starting values for the a, b (intercept, slope) parameters</span>
<span class="n">params0</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">0.0</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">])</span>
<span class="c"># Nelder-Mead simplex</span>
<span class="n">ridge_fit</span> <span class="o">=</span> <span class="n">opt</span><span class="o">.</span><span class="n">fmin</span><span class="p">(</span><span class="n">ridge_error</span><span class="p">,</span> <span class="n">params0</span><span class="p">,</span> <span class="n">args</span> <span class="o">=</span> <span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">Xmat</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="k">print</span> <span class="s">'Solution: a = </span><span class="si">%8.3f</span><span class="s">, b = </span><span class="si">%8.3f</span><span class="s"> '</span> <span class="o">%</span> <span class="nb">tuple</span><span class="p">(</span><span class="n">ridge_fit</span><span class="p">)</span>
<span class="n">Optimization</span> <span class="n">terminated</span> <span class="n">successfully</span><span class="o">.</span>
<span class="n">Current</span> <span class="n">function</span> <span class="n">value</span><span class="p">:</span> <span class="mf">1612442.197636</span>
<span class="n">Iterations</span><span class="p">:</span> <span class="mi">117</span>
<span class="n">Function</span> <span class="n">evaluations</span><span class="p">:</span> <span class="mi">221</span>
<span class="n">Solution</span><span class="p">:</span> <span class="n">a</span> <span class="o">=</span> <span class="o">-</span><span class="mf">340.565</span><span class="p">,</span> <span class="n">b</span> <span class="o">=</span> <span class="mf">7.565</span>
</pre></div>
<p>Now the Newton conjugate-gradient method. We need to give this function
a gradient; the Hessian is optional. First without the Hessian:</p>
<div class="highlight"><pre><span class="n">ridge_fit</span> <span class="o">=</span> <span class="n">opt</span><span class="o">.</span><span class="n">fmin_ncg</span><span class="p">(</span><span class="n">ridge_error</span><span class="p">,</span> <span class="n">params0</span><span class="p">,</span> <span class="n">fprime</span> <span class="o">=</span> <span class="n">ridge_grad</span><span class="p">,</span>
<span class="n">args</span> <span class="o">=</span> <span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">Xmat</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="k">print</span> <span class="s">'Solution: a = </span><span class="si">%8.3f</span><span class="s">, b = </span><span class="si">%8.3f</span><span class="s"> '</span> <span class="o">%</span> <span class="nb">tuple</span><span class="p">(</span><span class="n">ridge_fit</span><span class="p">)</span>
<span class="n">Optimization</span> <span class="n">terminated</span> <span class="n">successfully</span><span class="o">.</span>
<span class="n">Current</span> <span class="n">function</span> <span class="n">value</span><span class="p">:</span> <span class="mf">1612442.197636</span>
<span class="n">Iterations</span><span class="p">:</span> <span class="mi">3</span>
<span class="n">Function</span> <span class="n">evaluations</span><span class="p">:</span> <span class="mi">4</span>
<span class="n">Gradient</span> <span class="n">evaluations</span><span class="p">:</span> <span class="mi">11</span>
<span class="n">Hessian</span> <span class="n">evaluations</span><span class="p">:</span> <span class="mi">0</span>
<span class="n">Solution</span><span class="p">:</span> <span class="n">a</span> <span class="o">=</span> <span class="o">-</span><span class="mf">340.565</span><span class="p">,</span> <span class="n">b</span> <span class="o">=</span> <span class="mf">7.565</span>
</pre></div>
<p>Now supplying the Hessian:</p>
<div class="highlight"><pre><span class="n">ridge_fit</span> <span class="o">=</span> <span class="n">opt</span><span class="o">.</span><span class="n">fmin_ncg</span><span class="p">(</span><span class="n">ridge_error</span><span class="p">,</span> <span class="n">params0</span><span class="p">,</span> <span class="n">fprime</span> <span class="o">=</span>
<span class="n">ridge_grad</span><span class="p">,</span>
<span class="n">fhess</span> <span class="o">=</span> <span class="n">ridge_hess</span><span class="p">,</span> <span class="n">args</span> <span class="o">=</span> <span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">Xmat</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="k">print</span> <span class="s">'Solution: a = </span><span class="si">%8.3f</span><span class="s">, b = </span><span class="si">%8.3f</span><span class="s"> '</span> <span class="o">%</span> <span class="nb">tuple</span><span class="p">(</span><span class="n">ridge_fit</span><span class="p">)</span>
<span class="n">Optimization</span> <span class="n">terminated</span> <span class="n">successfully</span><span class="o">.</span>
<span class="n">Current</span> <span class="n">function</span> <span class="n">value</span><span class="p">:</span> <span class="mf">1612442.197636</span>
<span class="n">Iterations</span><span class="p">:</span> <span class="mi">3</span>
<span class="n">Function</span> <span class="n">evaluations</span><span class="p">:</span> <span class="mi">7</span>
<span class="n">Gradient</span> <span class="n">evaluations</span><span class="p">:</span> <span class="mi">3</span>
<span class="n">Hessian</span> <span class="n">evaluations</span><span class="p">:</span> <span class="mi">3</span>
<span class="n">Solution</span><span class="p">:</span> <span class="n">a</span> <span class="o">=</span> <span class="o">-</span><span class="mf">340.565</span><span class="p">,</span> <span class="n">b</span> <span class="o">=</span> <span class="mf">7.565</span>
</pre></div>
<p>Fortunately, we get the same results for all three methods. Supplying
the Hessian to the Newton method shaves some time off, but in this
simple application, it’s not really worth coding up a Hessian function
(except for fun).</p>
<p>Lastly, the <span class="caps">BFGS</span> method, supplied with the gradient:</p>
<div class="highlight"><pre><span class="n">ridge_fit</span> <span class="o">=</span> <span class="n">opt</span><span class="o">.</span><span class="n">fmin_ncg</span><span class="p">(</span><span class="n">ridge_error</span><span class="p">,</span> <span class="n">params0</span><span class="p">,</span> <span class="n">fprime</span> <span class="o">=</span> <span class="n">ridge_grad</span><span class="p">,</span>
<span class="n">fhess</span> <span class="o">=</span> <span class="n">ridge_hess</span><span class="p">,</span> <span class="n">args</span> <span class="o">=</span> <span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">Xmat</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="k">print</span> <span class="s">'Solution: a = </span><span class="si">%8.3f</span><span class="s">, b = </span><span class="si">%8.3f</span><span class="s"> '</span> <span class="o">%</span> <span class="nb">tuple</span><span class="p">(</span><span class="n">ridge_fit</span><span class="p">)</span>
<span class="n">Optimization</span> <span class="n">terminated</span> <span class="n">successfully</span><span class="o">.</span>
<span class="n">Current</span> <span class="n">function</span> <span class="n">value</span><span class="p">:</span> <span class="mf">1612442.197636</span>
<span class="n">Iterations</span><span class="p">:</span> <span class="mi">3</span>
<span class="n">Function</span> <span class="n">evaluations</span><span class="p">:</span> <span class="mi">7</span>
<span class="n">Gradient</span> <span class="n">evaluations</span><span class="p">:</span> <span class="mi">3</span>
<span class="n">Hessian</span> <span class="n">evaluations</span><span class="p">:</span> <span class="mi">3</span>
<span class="n">Solution</span><span class="p">:</span> <span class="n">a</span> <span class="o">=</span> <span class="o">-</span><span class="mf">340.565</span><span class="p">,</span> <span class="n">b</span> <span class="o">=</span> <span class="mf">7.565</span>
</pre></div>
<p>For this simple problem, all of these methods work well. For more
complicated problems, there are considerations which would lead you to
prefer one over another, or perhaps to use them in combination. There
are also several more methods available, some which allow you to solve
constrained optimization problems. Check out the very good
<a href="http://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html">documentation</a>. Also note that if you’re not into hand-coding
gradients, scipy has a function <code>derivative</code> in its <code>misc</code> module that
will compute numerical derivatives. In many cases, the functions will do
this automatically if you fail to provide a function to their gradient arguments.</p>
<h2>Optimizing on sentences with the Metropolis algorithm</h2>
<p>The second example in this chapter is a “code-breaking” exercise. We
start with a message “here is some sample text”, which we encrypt using
a Ceasar cipher that shifts each letter in the message to the next
letter in the alphabet (with Z going to A). We can represent the cipher
(or any cipher) in Python with a dict that maps each letter to its
encrypted counterpart.</p>
<div class="highlight"><pre><span class="n">letters</span> <span class="o">=</span> <span class="p">[</span><span class="s">'a'</span><span class="p">,</span> <span class="s">'b'</span><span class="p">,</span> <span class="s">'c'</span><span class="p">,</span> <span class="s">'d'</span><span class="p">,</span> <span class="s">'e'</span><span class="p">,</span> <span class="s">'f'</span><span class="p">,</span> <span class="s">'g'</span><span class="p">,</span> <span class="s">'h'</span><span class="p">,</span>
<span class="s">'i'</span><span class="p">,</span> <span class="s">'j'</span><span class="p">,</span> <span class="s">'k'</span><span class="p">,</span> <span class="s">'l'</span><span class="p">,</span> <span class="s">'m'</span><span class="p">,</span> <span class="s">'n'</span><span class="p">,</span> <span class="s">'o'</span><span class="p">,</span> <span class="s">'p'</span><span class="p">,</span>
<span class="s">'q'</span><span class="p">,</span> <span class="s">'r'</span><span class="p">,</span> <span class="s">'s'</span><span class="p">,</span> <span class="s">'t'</span><span class="p">,</span> <span class="s">'u'</span><span class="p">,</span> <span class="s">'v'</span><span class="p">,</span> <span class="s">'w'</span><span class="p">,</span> <span class="s">'x'</span><span class="p">,</span>
<span class="s">'y'</span><span class="p">,</span> <span class="s">'z'</span><span class="p">]</span>
<span class="n">ceasar_cipher</span> <span class="o">=</span> <span class="p">{</span><span class="n">i</span><span class="p">:</span> <span class="n">j</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">letters</span><span class="p">,</span> <span class="n">letters</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span> <span class="o">+</span> <span class="n">letters</span><span class="p">[:</span><span class="mi">1</span><span class="p">])}</span>
<span class="n">inverse_ceasar_cipher</span> <span class="o">=</span> <span class="p">{</span><span class="n">ceasar_cipher</span><span class="p">[</span><span class="n">k</span><span class="p">]:</span> <span class="n">k</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">ceasar_cipher</span><span class="p">}</span>
</pre></div>
<p>The <code>inverse_ceasar_cipher</code> dict reverses the cipher, so we can get an
original message back from one that’s been encrypted by the Ceasar
cipher. Based on these structures, let’s make functions that will
encrypt and decrypt text.</p>
<div class="highlight"><pre><span class="k">def</span> <span class="nf">cipher_text</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">cipher_dict</span> <span class="o">=</span> <span class="n">ceasar_cipher</span><span class="p">):</span>
<span class="c"># Split the string into a list of characters to apply</span>
<span class="c"># the decoder over.</span>
<span class="n">strlist</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
<span class="n">ciphered</span> <span class="o">=</span> <span class="s">''</span><span class="o">.</span><span class="n">join</span><span class="p">([</span><span class="n">cipher_dict</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="ow">or</span> <span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">strlist</span><span class="p">])</span>
<span class="k">return</span> <span class="n">ciphered</span>
<span class="k">def</span> <span class="nf">decipher_text</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">cipher_dict</span> <span class="o">=</span> <span class="n">ceasar_cipher</span><span class="p">):</span>
<span class="c"># Split the string into a list of characters to apply</span>
<span class="c"># the decoder over.</span>
<span class="n">strlist</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
<span class="c"># Invert the cipher dictionary (k, v) -> (v, k)</span>
<span class="n">decipher_dict</span> <span class="o">=</span> <span class="p">{</span><span class="n">cipher_dict</span><span class="p">[</span><span class="n">k</span><span class="p">]:</span> <span class="n">k</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">cipher_dict</span><span class="p">}</span>
<span class="n">deciphered</span> <span class="o">=</span> <span class="s">''</span><span class="o">.</span><span class="n">join</span><span class="p">([</span><span class="n">decipher_dict</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="ow">or</span> <span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">strlist</span><span class="p">])</span>
<span class="k">return</span> <span class="n">deciphered</span>
</pre></div>
<p>To decrypt our message, we’ll design a Metropolis algorithm that
randomly proposes ciphers, decrypts the message according to the
proposed cipher, and see’s how probable that message is based on a
lexical database of word frequency in Wikipedia.</p>
<p>The following functions are used to generate proposal ciphers for the
Metropolis algorithm. The idea is to randomly generate ciphers and see
what text they result in. If the text resulting from a proposed cipher
is more likely (according to the lexical database) than the current
cipher, we accept the proposal. If it’s not, we accept it wil a
probability that is lower the less likely the resulting text is.</p>
<p>The method of generating new proposals is important. The authors use a
method that chooses a key (letter) at random from the current cipher,
and swaps its with some other letter. For example, if we start with the
Ceasar Cipher, our proposal might randomly choose to re-map A to N
(instead of B). The proposal would then be the same a the Ceasar Cipher,
but with A → N and M → B (since A originally mapped to B and M
originally mapped to N). This proposal-generating mechanism is
encapsulated in <code>propose_modified_cipher_from_cipher</code>.</p>
<p>This is inefficient in a few ways. First, the letter chosen to modify in
the cipher may not even appear in the text, so the proposed cipher won’t
modify the text at all and you end up wasting cycles generating a lot of
useless proposals. Second, we may end up picking a letter that occurs in
a highly likely word, which will increase the probability of generating
an inferior proposal.</p>
<p>We’ll suggest another mechanism that, instead of selecting a letter from
the current cipher to re-map, will choose a letter amongst the non-words
in the current deciphered text. For example, if our current deciphered
text is “hello wqrld”, we will only select amongst {w, q, r, l, d} to
modify at random. The minimizes the chances that a modified cipher will
turn real words into gibberish and produce less likely text. The
function propose_modified_cipher_from_text performs this proposal mechanism.</p>
<p>One way to think of this is that it’s analogous to tuning the variance
of the proposal distribution in the typical Metropolis algorithm. If the
variance is too low, our algorithm won’t efficiently explore the target
distribution. If it’s too high, we’ll end up generating lots of lousy
proposals. Our cipher proposal rules can suffer from similar problems.</p>
<div class="highlight"><pre><span class="k">def</span> <span class="nf">generate_random_cipher</span><span class="p">():</span>
<span class="sd">'''</span>
<span class="sd"> Randomly generate a cipher dictionary (a one-to-one letter -> letter</span>
<span class="sd"> map).</span>
<span class="sd"> Used to generate the starting cipher of the algorithm.</span>
<span class="sd"> '''</span>
<span class="n">cipher</span> <span class="o">=</span> <span class="p">[]</span>
<span class="nb">input</span> <span class="o">=</span> <span class="n">letters</span>
<span class="n">output</span> <span class="o">=</span> <span class="n">letters</span><span class="p">[:]</span>
<span class="n">random</span><span class="o">.</span><span class="n">shuffle</span><span class="p">(</span><span class="n">output</span><span class="p">)</span>
<span class="n">cipher_dict</span> <span class="o">=</span> <span class="p">{</span><span class="n">k</span><span class="p">:</span> <span class="n">v</span> <span class="k">for</span> <span class="p">(</span><span class="n">k</span><span class="p">,</span> <span class="n">v</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="nb">input</span><span class="p">,</span> <span class="n">output</span><span class="p">)}</span>
<span class="k">return</span> <span class="n">cipher_dict</span>
<span class="k">def</span> <span class="nf">modify_cipher</span><span class="p">(</span><span class="n">cipher_dict</span><span class="p">,</span> <span class="nb">input</span><span class="p">,</span> <span class="n">new_output</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Swap a single key in a cipher dictionary.</span>
<span class="sd"> Old: a -> b, ..., m -> n, ...</span>
<span class="sd"> New: a -> n, ..., m -> b, ...</span>
<span class="sd"> '''</span>
<span class="n">decipher_dict</span> <span class="o">=</span> <span class="p">{</span><span class="n">cipher_dict</span><span class="p">[</span><span class="n">k</span><span class="p">]:</span> <span class="n">k</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">cipher_dict</span><span class="p">}</span>
<span class="n">old_output</span> <span class="o">=</span> <span class="n">cipher_dict</span><span class="p">[</span><span class="nb">input</span><span class="p">]</span>
<span class="n">new_cipher</span> <span class="o">=</span> <span class="n">cipher_dict</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span>
<span class="n">new_cipher</span><span class="p">[</span><span class="nb">input</span><span class="p">]</span> <span class="o">=</span> <span class="n">new_output</span>
<span class="n">new_cipher</span><span class="p">[</span><span class="n">decipher_dict</span><span class="p">[</span><span class="n">new_output</span><span class="p">]]</span> <span class="o">=</span> <span class="n">old_output</span>
<span class="k">return</span> <span class="n">new_cipher</span>
<span class="k">def</span> <span class="nf">propose_modified_cipher_from_cipher</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">cipher_dict</span><span class="p">,</span>
<span class="n">lexical_db</span> <span class="o">=</span> <span class="n">lexical_database</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Generates a new cipher by choosing and swapping a key in the</span>
<span class="sd"> current cipher.</span>
<span class="sd"> '''</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">text</span> <span class="c"># Unused</span>
<span class="nb">input</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="n">cipher_dict</span><span class="o">.</span><span class="n">keys</span><span class="p">(),</span> <span class="mi">1</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">new_output</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="n">letters</span><span class="p">,</span> <span class="mi">1</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">return</span> <span class="n">modify_cipher</span><span class="p">(</span><span class="n">cipher_dict</span><span class="p">,</span> <span class="nb">input</span><span class="p">,</span> <span class="n">new_output</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">propose_modified_cipher_from_text</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">cipher_dict</span><span class="p">,</span>
<span class="n">lexical_db</span> <span class="o">=</span> <span class="n">lexical_database</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Generates a new cipher by choosing a swapping a key in the current</span>
<span class="sd"> cipher, but only chooses keys that are letters that appear in the</span>
<span class="sd"> gibberish words in the current text.</span>
<span class="sd"> '''</span>
<span class="n">deciphered</span> <span class="o">=</span> <span class="n">decipher_text</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">cipher_dict</span><span class="p">)</span><span class="o">.</span><span class="n">split</span><span class="p">()</span>
<span class="n">letters_to_sample</span> <span class="o">=</span> <span class="s">''</span><span class="o">.</span><span class="n">join</span><span class="p">([</span><span class="n">t</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">deciphered</span>
<span class="k">if</span> <span class="n">lexical_db</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">])</span>
<span class="n">letters_to_sample</span> <span class="o">=</span> <span class="n">letters_to_sample</span> <span class="ow">or</span> <span class="s">''</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">deciphered</span><span class="p">))</span>
<span class="nb">input</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="n">letters_to_sample</span><span class="p">,</span> <span class="mi">1</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">new_output</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="n">letters</span><span class="p">,</span> <span class="mi">1</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">return</span> <span class="n">modify_cipher</span><span class="p">(</span><span class="n">cipher_dict</span><span class="p">,</span> <span class="nb">input</span><span class="p">,</span> <span class="n">new_output</span><span class="p">)</span>
</pre></div>
<p>Next, we need to be able to compute a message’s likelihood (from the
lexical database). The log-likelihood of a message is just the sum of
the log-likelihoods of each word (one-gram) in the message. If the word
is gibberish (i.e., doesn’t occur in the database) it gets a tiny
probability set to the smallest floating-point precision.</p>
<div class="highlight"><pre><span class="k">def</span> <span class="nf">one_gram_prob</span><span class="p">(</span><span class="n">one_gram</span><span class="p">,</span> <span class="n">lexical_db</span> <span class="o">=</span> <span class="n">lexical_database</span><span class="p">):</span>
<span class="k">return</span> <span class="n">lexical_db</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">one_gram</span><span class="p">)</span> <span class="ow">or</span> <span class="n">np</span><span class="o">.</span><span class="n">finfo</span><span class="p">(</span><span class="nb">float</span><span class="p">)</span><span class="o">.</span><span class="n">eps</span>
<span class="k">def</span> <span class="nf">text_logp</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">cipher_dict</span><span class="p">,</span> <span class="n">lexical_db</span> <span class="o">=</span> <span class="n">lexical_database</span><span class="p">):</span>
<span class="n">deciphered</span> <span class="o">=</span> <span class="n">decipher_text</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">cipher_dict</span><span class="p">)</span><span class="o">.</span><span class="n">split</span><span class="p">()</span>
<span class="n">logp</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">math</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">one_gram_prob</span><span class="p">(</span><span class="n">w</span><span class="p">))</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span>
<span class="n">deciphered</span><span class="p">])</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
<span class="k">return</span> <span class="n">logp</span>
</pre></div>
<p>We can now use these functions in our Metropolis algorithm. Each step in
the metropolis algorithm proposes a cipher, deciphers the text according
the proposal, and computes the log-likelihood of the deciphered message.
If the likelihood of the deciphered message is better under the proposal
cipher than the current cipher, we definitely accept that proposal for
our next step. If not, we only accept the proposal with a probability
based on the relative likelihood of the proposal to the current cipher.</p>
<p>I’ll define this function to take an arbitrary proposal function via the
<code>proposal_rule</code> argument. So far, this can be one of the two
<code>propose_modified_cipher_from_*</code> functions defined above.</p>
<div class="highlight"><pre><span class="k">def</span> <span class="nf">metropolis_step</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">cipher_dict</span><span class="p">,</span> <span class="n">proposal_rule</span><span class="p">,</span> <span class="n">lexical_db</span> <span class="o">=</span>
<span class="n">lexical_database</span><span class="p">):</span>
<span class="n">proposed_cipher</span> <span class="o">=</span> <span class="n">proposal_rule</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">cipher_dict</span><span class="p">)</span>
<span class="n">lp1</span> <span class="o">=</span> <span class="n">text_logp</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">cipher_dict</span><span class="p">)</span>
<span class="n">lp2</span> <span class="o">=</span> <span class="n">text_logp</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">proposed_cipher</span><span class="p">)</span>
<span class="k">if</span> <span class="n">lp2</span> <span class="o">></span> <span class="n">lp1</span><span class="p">:</span>
<span class="k">return</span> <span class="n">proposed_cipher</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">math</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">lp2</span> <span class="o">-</span> <span class="n">lp1</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">random</span><span class="p">()</span>
<span class="k">if</span> <span class="n">x</span> <span class="o"><</span> <span class="n">a</span><span class="p">:</span>
<span class="k">return</span> <span class="n">proposed_cipher</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">cipher_dict</span>
</pre></div>
<p>To run the algorithm, just wrap the step function inside a loop. There’s
no stopping rule for the algorithm, so we have to choose a number of
iterations, and hope it’s enough to get us to the optimum. Let’s use 250,000.</p>
<div class="highlight"><pre><span class="n">message</span> <span class="o">=</span> <span class="s">'here is some sample text'</span>
<span class="n">ciphered_text</span> <span class="o">=</span> <span class="n">cipher_text</span><span class="p">(</span><span class="n">message</span><span class="p">,</span> <span class="n">ceasar_cipher</span><span class="p">)</span>
<span class="n">niter</span> <span class="o">=</span> <span class="mi">250000</span>
<span class="k">def</span> <span class="nf">metropolis_decipher</span><span class="p">(</span><span class="n">ciphered_text</span><span class="p">,</span> <span class="n">proposal_rule</span><span class="p">,</span> <span class="n">niter</span><span class="p">,</span> <span class="n">seed</span> <span class="o">=</span> <span class="mi">4</span><span class="p">):</span>
<span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span>
<span class="n">cipher</span> <span class="o">=</span> <span class="n">generate_random_cipher</span><span class="p">()</span>
<span class="n">deciphered_text_list</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">logp_list</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="n">niter</span><span class="p">):</span>
<span class="n">logp</span> <span class="o">=</span> <span class="n">text_logp</span><span class="p">(</span><span class="n">ciphered_text</span><span class="p">,</span> <span class="n">cipher</span><span class="p">)</span>
<span class="n">current_deciphered_text</span> <span class="o">=</span> <span class="n">decipher_text</span><span class="p">(</span><span class="n">ciphered_text</span><span class="p">,</span> <span class="n">cipher</span><span class="p">)</span>
<span class="n">deciphered_text_list</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">current_deciphered_text</span><span class="p">)</span>
<span class="n">logp_list</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">logp</span><span class="p">)</span>
<span class="n">cipher</span> <span class="o">=</span> <span class="n">metropolis_step</span><span class="p">(</span><span class="n">ciphered_text</span><span class="p">,</span> <span class="n">cipher</span><span class="p">,</span> <span class="n">proposal_rule</span><span class="p">)</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">DataFrame</span><span class="p">({</span><span class="s">'deciphered_text'</span><span class="p">:</span> <span class="n">deciphered_text_list</span><span class="p">,</span> <span class="s">'logp'</span><span class="p">:</span>
<span class="n">logp_list</span><span class="p">})</span>
<span class="n">results</span><span class="o">.</span><span class="n">index</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">niter</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">results</span>
</pre></div>
<p>First let’s look at the authors’ proposal rule. While they managed to get a reasonable decrypted message
in about 50,000 iterations, we’re still reading gibberish after 250,000.
As they say in the book, their results are an artefact of a lucky seed value.</p>
<div class="highlight"><pre><span class="n">results0</span> <span class="o">=</span> <span class="n">metropolis_decipher</span><span class="p">(</span><span class="n">ciphered_text</span><span class="p">,</span>
<span class="n">propose_modified_cipher_from_cipher</span><span class="p">,</span> <span class="n">niter</span><span class="p">)</span>
<span class="k">print</span> <span class="n">results0</span><span class="o">.</span><span class="n">ix</span><span class="p">[</span><span class="mi">10000</span><span class="p">::</span><span class="mi">10000</span><span class="p">]</span>
<span class="n">deciphered_text</span> <span class="n">logp</span>
<span class="mi">10000</span> <span class="n">kudu</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">fyrvbu</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">86.585205</span>
<span class="mi">20000</span> <span class="n">wudu</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">fbrkxu</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">87.124919</span>
<span class="mi">30000</span> <span class="n">kudu</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">fnrbau</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">86.585205</span>
<span class="mi">40000</span> <span class="n">wudu</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">fmrjiu</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">87.124919</span>
<span class="mi">50000</span> <span class="n">kudu</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">fyrnbu</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">86.585205</span>
<span class="mi">60000</span> <span class="n">kudu</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">fxrnvu</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">86.585205</span>
<span class="mi">70000</span> <span class="n">pudu</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">fvrnlu</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">87.561022</span>
<span class="mi">80000</span> <span class="n">kudu</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">fvrxgu</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">86.585205</span>
<span class="mi">90000</span> <span class="n">kudu</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">fbrvtu</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">86.585205</span>
<span class="mi">100000</span> <span class="n">kudu</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">fjrnlu</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">86.585205</span>
<span class="mi">110000</span> <span class="n">kudu</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">fprbju</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">86.585205</span>
<span class="mi">120000</span> <span class="n">kudu</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">fnrjcu</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">86.585205</span>
<span class="mi">130000</span> <span class="n">kudu</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">flrvpu</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">86.585205</span>
<span class="mi">140000</span> <span class="n">puku</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">flrvxu</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">88.028362</span>
<span class="mi">150000</span> <span class="n">kudu</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">fxrviu</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">86.585205</span>
<span class="mi">160000</span> <span class="n">pulu</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">ftrdzu</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">88.323162</span>
<span class="mi">170000</span> <span class="n">wuzu</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">flrxdu</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">89.575925</span>
<span class="mi">180000</span> <span class="n">kudu</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">firamu</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">86.585205</span>
<span class="mi">190000</span> <span class="n">wudu</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">fyrzqu</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">87.124919</span>
<span class="mi">200000</span> <span class="n">wudu</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">fnraxu</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">87.124919</span>
<span class="mi">210000</span> <span class="n">puku</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">fjrnyu</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">88.028362</span>
<span class="mi">220000</span> <span class="n">puku</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">firyau</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">88.028362</span>
<span class="mi">230000</span> <span class="n">pudu</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">fkrcvu</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">87.561022</span>
<span class="mi">240000</span> <span class="n">kudu</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">ftrwzu</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">86.585205</span>
<span class="mi">250000</span> <span class="n">kudu</span> <span class="n">of</span> <span class="n">feru</span> <span class="n">fprxzu</span> <span class="n">hush</span> <span class="o">-</span><span class="mf">86.585205</span>
</pre></div>
<p>Now, let’s try the alternative proposal rule, which only chooses letters
from gibberish words when it modifies the current cipher to propose a
new one. The algorithm doesn’t find the actual message, but it actually
finds a more likely message (according the the lexical database) within
20,000 iterations.</p>
<div class="highlight"><pre><span class="n">results1</span> <span class="o">=</span> <span class="n">metropolis_decipher</span><span class="p">(</span><span class="n">ciphered_text</span><span class="p">,</span>
<span class="n">propose_modified_cipher_from_text</span><span class="p">,</span> <span class="n">niter</span><span class="p">)</span>
<span class="k">print</span> <span class="n">results1</span><span class="o">.</span><span class="n">ix</span><span class="p">[</span><span class="mi">10000</span><span class="p">::</span><span class="mi">10000</span><span class="p">]</span>
<span class="n">deciphered_text</span> <span class="n">logp</span>
<span class="mi">10000</span> <span class="n">were</span> <span class="n">mi</span> <span class="n">isle</span> <span class="n">izlkde</span> <span class="n">text</span> <span class="o">-</span><span class="mf">68.946850</span>
<span class="mi">20000</span> <span class="n">were</span> <span class="k">as</span> <span class="n">some</span> <span class="n">simple</span> <span class="n">text</span> <span class="o">-</span><span class="mf">35.784429</span>
<span class="mi">30000</span> <span class="n">were</span> <span class="k">as</span> <span class="n">some</span> <span class="n">simple</span> <span class="n">text</span> <span class="o">-</span><span class="mf">35.784429</span>
<span class="mi">40000</span> <span class="n">were</span> <span class="k">as</span> <span class="n">some</span> <span class="n">simple</span> <span class="n">text</span> <span class="o">-</span><span class="mf">35.784429</span>
<span class="mi">50000</span> <span class="n">were</span> <span class="k">as</span> <span class="n">some</span> <span class="n">simple</span> <span class="n">text</span> <span class="o">-</span><span class="mf">35.784429</span>
<span class="mi">60000</span> <span class="n">were</span> <span class="k">as</span> <span class="n">some</span> <span class="n">simple</span> <span class="n">text</span> <span class="o">-</span><span class="mf">35.784429</span>
<span class="mi">70000</span> <span class="n">were</span> <span class="n">us</span> <span class="n">some</span> <span class="n">simple</span> <span class="n">text</span> <span class="o">-</span><span class="mf">38.176725</span>
<span class="mi">80000</span> <span class="n">were</span> <span class="k">as</span> <span class="n">some</span> <span class="n">simple</span> <span class="n">text</span> <span class="o">-</span><span class="mf">35.784429</span>
<span class="mi">90000</span> <span class="n">were</span> <span class="k">as</span> <span class="n">some</span> <span class="n">simple</span> <span class="n">text</span> <span class="o">-</span><span class="mf">35.784429</span>
<span class="mi">100000</span> <span class="n">were</span> <span class="k">as</span> <span class="n">some</span> <span class="n">simple</span> <span class="n">text</span> <span class="o">-</span><span class="mf">35.784429</span>
<span class="mi">110000</span> <span class="n">were</span> <span class="k">as</span> <span class="n">some</span> <span class="n">simple</span> <span class="n">text</span> <span class="o">-</span><span class="mf">35.784429</span>
<span class="mi">120000</span> <span class="n">were</span> <span class="k">as</span> <span class="n">some</span> <span class="n">simple</span> <span class="n">text</span> <span class="o">-</span><span class="mf">35.784429</span>
<span class="mi">130000</span> <span class="n">were</span> <span class="k">as</span> <span class="n">some</span> <span class="n">simple</span> <span class="n">text</span> <span class="o">-</span><span class="mf">35.784429</span>
<span class="mi">140000</span> <span class="n">were</span> <span class="k">as</span> <span class="n">some</span> <span class="n">simple</span> <span class="n">text</span> <span class="o">-</span><span class="mf">35.784429</span>
<span class="mi">150000</span> <span class="n">were</span> <span class="n">us</span> <span class="n">some</span> <span class="n">simple</span> <span class="n">text</span> <span class="o">-</span><span class="mf">38.176725</span>
<span class="mi">160000</span> <span class="n">were</span> <span class="k">as</span> <span class="n">some</span> <span class="n">simple</span> <span class="n">text</span> <span class="o">-</span><span class="mf">35.784429</span>
<span class="mi">170000</span> <span class="n">were</span> <span class="ow">is</span> <span class="n">some</span> <span class="n">sample</span> <span class="n">text</span> <span class="o">-</span><span class="mf">37.012894</span>
<span class="mi">180000</span> <span class="n">were</span> <span class="k">as</span> <span class="n">some</span> <span class="n">simple</span> <span class="n">text</span> <span class="o">-</span><span class="mf">35.784429</span>
<span class="mi">190000</span> <span class="n">were</span> <span class="k">as</span> <span class="n">some</span> <span class="n">simple</span> <span class="n">text</span> <span class="o">-</span><span class="mf">35.784429</span>
<span class="mi">200000</span> <span class="n">were</span> <span class="k">as</span> <span class="n">some</span> <span class="n">simple</span> <span class="n">text</span> <span class="o">-</span><span class="mf">35.784429</span>
<span class="mi">210000</span> <span class="n">were</span> <span class="k">as</span> <span class="n">some</span> <span class="n">simple</span> <span class="n">text</span> <span class="o">-</span><span class="mf">35.784429</span>
<span class="mi">220000</span> <span class="n">were</span> <span class="k">as</span> <span class="n">some</span> <span class="n">simple</span> <span class="n">text</span> <span class="o">-</span><span class="mf">35.784429</span>
<span class="mi">230000</span> <span class="n">were</span> <span class="k">as</span> <span class="n">some</span> <span class="n">simple</span> <span class="n">text</span> <span class="o">-</span><span class="mf">35.784429</span>
<span class="mi">240000</span> <span class="n">were</span> <span class="k">as</span> <span class="n">some</span> <span class="n">simple</span> <span class="n">text</span> <span class="o">-</span><span class="mf">35.784429</span>
<span class="mi">250000</span> <span class="n">were</span> <span class="ow">is</span> <span class="n">some</span> <span class="n">sample</span> <span class="n">text</span> <span class="o">-</span><span class="mf">37.012894</span>
</pre></div>
<p>The graph below plots the likelihood paths of the algorithm for the two
proposal rules. The blue line is the log-likelihood of the original
message we’re trying to recover.</p>
<p><a href="../images/metropolis_likpaths.png">
<img src="../images/metropolis_likpaths.png" width=400px />
</a></p>
<h2>Direct calculation of the most likely message</h2>
<p>The Metropolis algorithm is kind of pointless for this application. It’s
really just jumping around looking for the most likely phrase. But since
the likelihood of a message is just the sum of the log probabilities of
the log probabilities of its component words, we just need to look for
the most likely words of the lengths of the words of the ciphered message.</p>
<p>If the message at some point is “fgk tp hpdt”, then, if run long enough,
the algorithm should just find the most likely three-letter word, the
most likely two-letter word, and the most likely four-letter word. But
we can look these up directly.</p>
<p>For example, the message we encrypted is ‘here is some sample text’,
which has word lengths 4, 2, 4, 6, 4. What’s the most likely message
with these word lengths?</p>
<div class="highlight"><pre><span class="k">def</span> <span class="nf">maxprob_message</span><span class="p">(</span><span class="n">word_lens</span> <span class="o">=</span> <span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">4</span><span class="p">),</span> <span class="n">lexical_db</span> <span class="o">=</span>
<span class="n">lexical_database</span><span class="p">):</span>
<span class="n">db_word_series</span> <span class="o">=</span> <span class="n">Series</span><span class="p">(</span><span class="n">lexical_db</span><span class="o">.</span><span class="n">index</span><span class="p">)</span>
<span class="n">db_word_len</span> <span class="o">=</span> <span class="n">db_word_series</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">len</span><span class="p">()</span>
<span class="n">max_prob_wordlist</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">logp</span> <span class="o">=</span> <span class="mf">0.0</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">word_lens</span><span class="p">:</span>
<span class="n">db_words_i</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">db_word_series</span><span class="p">[</span><span class="n">db_word_len</span> <span class="o">==</span> <span class="n">i</span><span class="p">])</span>
<span class="n">db_max_prob_word</span> <span class="o">=</span> <span class="n">lexical_db</span><span class="p">[</span><span class="n">db_words_i</span><span class="p">]</span><span class="o">.</span><span class="n">idxmax</span><span class="p">()</span>
<span class="n">logp</span> <span class="o">+=</span> <span class="n">math</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">lexical_db</span><span class="p">[</span><span class="n">db_words_i</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())</span>
<span class="n">max_prob_wordlist</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">db_max_prob_word</span><span class="p">)</span>
<span class="k">return</span> <span class="n">max_prob_wordlist</span><span class="p">,</span> <span class="n">logp</span>
<span class="n">maxprob_message</span><span class="p">()</span>
<span class="p">([</span><span class="s">'with'</span><span class="p">,</span> <span class="s">'of'</span><span class="p">,</span> <span class="s">'with'</span><span class="p">,</span> <span class="s">'united'</span><span class="p">,</span> <span class="s">'with'</span><span class="p">],</span> <span class="o">-</span><span class="mf">25.642396806584493</span><span class="p">)</span>
</pre></div>
<p>So, technically, we should have decoded our message to be “with of
united with” instead of “here is some sample text”. This is not a
shining endorsement of this methodology for decrypting messages.</p>
<h2>Conclusion</h2>
<p>While it was a fun exercise to code up the Metropolis decrypter in this
chapter, it didn’t show off any new Python functionality. The ridge
problem, while less interesting, showed off some of the optimization
algorithms in Scipy. There’s a lot of good stuff in Scipy’s <code>optimize</code>
module, and its <a href="http://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html">documentation</a> is worth checking out.</p>Machine Learning for Hackers Chapter 6: Regression models with regularization2013-02-08T20:07:00-05:00Carltag:slendermeans.org,2013-02-08:ml4h-ch6.html<p>In my opinion, Chapter 6 is the most important chapter in <em>Machine
Learning for Hackers</em>. It introduces the fundamental problem of machine
learning: overfitting and the bias-variance tradeoff. And it
demonstrates the two key tools for dealing with it: regularization and cross-validation.</p>
<p>It’s also a fun chapter to write in Python, because it lets me play with
the fantastic <a href="http://scikit-learn.org/stable/">scikit-learn</a> library. scikit-learn is loaded with
hi-tech machine learning models, along with convenient “pipeline”-type
functions that facilitate the process of cross-validating and selecting
hyperparameters for models. Best of all, it’s <a href="http://scikit-learn.org/stable/">very well
documented</a>.</p>
<h2>Fitting a sine wave with polynomial regression</h2>
<p>The chapter starts out with a useful toy example—trying to fit a curve
to data generated by a sine function over the interval [0, 1] with added
Gaussian noise. The natural way to fit nonlinear data like this is using
a polynomial function, so that the output, <em>y</em> is a function of powers
of the input <em>x</em>. But there are two problems with this.</p>
<p>First, we can generate highly correlated regressors by taking powers of
<em>x</em>, leading to noisy parameter estimates. The input <em>x</em> are evenly
space numbers on the interval [0, 1]. So <em>x</em> and <em>x<sup>2</sup></em> are going to
have a correlation over 95%. Similar with <em>x<sup>2</sup></em> and <em>x<sup>3</sup></em>. The
solution to this is to use <em>orthogonalized</em> polynomial functions:
tranformations of x that, when summed, result in polynomial functions,
but are orthogonal (therefore uncorrelated) with each other.</p>
<p>Luckily, we can easily calculate these transformations using patsy. The
<code>C(x, Poly)</code> transform computes orthonormal polynomial functions of <em>x</em>,
then we’ll extract out various orders of the polynomial. So
<code>Xpoly[:, :2]</code> selects out the 0th and 1st order functions, then when
summed will give us a first order polynomial (i.e. linear). Similarly
<code>Xpoly[: :4]</code> gives us the 0th through 3rd order functions, which sum up
to a cubic polynomial.</p>
<div class="highlight"><pre><span class="n">sin_data</span> <span class="o">=</span> <span class="n">DataFrame</span><span class="p">({</span><span class="s">'x'</span> <span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">101</span><span class="p">)})</span>
<span class="n">sin_data</span><span class="p">[</span><span class="s">'y'</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sin</span><span class="p">(</span><span class="mi">2</span> \<span class="o">*</span> <span class="n">pi</span> \<span class="o">*</span> <span class="n">sin_data</span><span class="p">[</span><span class="s">'x'</span><span class="p">])</span> <span class="o">+</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mi">101</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">sin_data</span><span class="p">[</span><span class="s">'x'</span><span class="p">]</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">sin_data</span><span class="p">[</span><span class="s">'y'</span><span class="p">]</span>
<span class="n">Xpoly</span> <span class="o">=</span> <span class="n">dmatrix</span><span class="p">(</span><span class="s">'C(x, Poly)'</span><span class="p">)</span>
<span class="n">Xpoly1</span> <span class="o">=</span> <span class="n">Xpoly</span><span class="p">[:,</span> <span class="p">:</span><span class="mi">2</span><span class="p">]</span>
<span class="n">Xpoly3</span> <span class="o">=</span> <span class="n">Xpoly</span><span class="p">[:,</span> <span class="p">:</span><span class="mi">4</span><span class="p">]</span>
<span class="n">Xpoly5</span> <span class="o">=</span> <span class="n">Xpoly</span><span class="p">[:,</span> <span class="p">:</span><span class="mi">6</span><span class="p">]</span>
<span class="n">Xpoly25</span> <span class="o">=</span> <span class="n">Xpoly</span><span class="p">[:,</span> <span class="p">:</span><span class="mi">26</span><span class="p">]</span>
</pre></div>
<p>The problem we encounter now is how to choose what order polynomial to
fit to the data. Any data can be fit well (i.e. have a high R^2^) if we
use a high enough order polynomial. But we will start to over-fit our
data; capturing noise specific to our sample, leading to poor
predictions on new data. The graph below shows the fits to the data of a
straight line, a 3rd-order polynomial, a 5th-order polynomial, and a
25th-order polynomial. Notice how the last fit gives us all kinds of
degrees of freedom to capture specific datapoints, and the excessive
“wiggles” look like we’re fitting to noise.</p>
<p><a href="../images/sine_wave_polyfits.png">
<img src="../images/sine_wave_polyfits.png" width = 450px />
</a></p>
<p>In machine learning, this problem is solved with
<em>regularization</em>—penalizing large parameter estimates in a way that,
hopefully, shrinks down the coefficients on all but the most important
inputs. Here’s where scikit-learn shines.</p>
<h2>Preventing overfitting with regularization</h2>
<p>The penalty parameter in a regularized regression is typically found via
cross-validation; for each candidate penalty one repeatedly fits the
model on subsets on the data, and the penalty value that gives the best
fit across the cross-validation “folds” is chosen. In the book, the
authors hand-code up a cross-validation scheme, looping over possible
penalties and subsets of the data and recording the MSEs.</p>
<p>In scikit-learn you can usually automate the cross-validation procedure,
by one of a couple of ways. Many models have a <code>CV</code>version, or, if not,
you can wrap your model in a function like <a href="http://scikit-learn.org/stable/modules/grid_search.html"><code>GridSearchCV</code></a> which is a
convenience function around all the looping and fit-recording entailed
in a cross-validation. Here I’ll use the <a href="http://scikit-learn.org/stable/modules/linear_model.html"><code>LassoCV</code></a> function, which
performs cross-validation for a <span class="caps">LASSO</span>-penalized linear regression.</p>
<div class="highlight"><pre><span class="n">lasso_model</span> <span class="o">=</span> <span class="n">LassoCV</span><span class="p">(</span><span class="n">cv</span> <span class="o">=</span> <span class="mi">15</span><span class="p">,</span> <span class="n">copy_X</span> <span class="o">=</span> <span class="bp">True</span><span class="p">,</span> <span class="n">normalize</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="n">lasso_fit</span> <span class="o">=</span> <span class="n">lasso_model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">Xpoly</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">:</span><span class="mi">11</span><span class="p">],</span> <span class="n">y</span><span class="p">)</span>
<span class="n">lasso_path</span> <span class="o">=</span> <span class="n">lasso_model</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">Xpoly</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">:</span><span class="mi">11</span><span class="p">],</span> <span class="n">y</span><span class="p">)</span>
</pre></div>
<p>The first line sets up the model by specifying some options. The only
interesting one here is <code>cv</code>, which specifies how many cross-validation
folds to run on each penalty-parameter value. The second line fits the
model: here’s I’m going to run a 10th-order polynomial regression, and
let the <span class="caps">LASSO</span> penalty shrink away all but the most important orders.
Finally, <code>lasso_path</code> provides the objective function that our penalty
parameter is suppose to optimize in the cross-validations (typically
<span class="caps">RMSE</span>).</p>
<p>After running the <code>fit()</code> method, <code>LassoCV</code> will provide useful output
attributes, including the “optimal” penalty parameter, stored in
<code>.alpha_</code>. Note that scikit-learn refers to the penalty parameter as
<code>alpha</code>, while R’s <code>glmnet</code>, which the authors use to implement the
<span class="caps">LASSO</span> model, calls it <code>lambda</code>. I’m more accustomed to the penalty
parameter being denoted with lambda myself. Note also that <code>glmnet</code> uses
<code>alpha</code> elsewhere.</p>
<div class="highlight"><pre><span class="c"># Plot the average MSE across folds</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">lasso_fit</span><span class="o">.</span><span class="n">alphas_</span><span class="p">),</span>
<span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">lasso_fit</span><span class="o">.</span><span class="n">mse_path_</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span> <span class="o">=</span> <span class="mi">1</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'RMSE (avg. across folds)'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">r'\$-</span><span class="se">\\</span><span class="s">log(</span><span class="se">\\</span><span class="s">lambda)\$'</span><span class="p">)</span>
<span class="c"># Indicate the lasso parameter that minimizes the average MSE across</span>
<span class="n">folds</span><span class="o">.</span>
<span class="n">plt</span><span class="o">.</span><span class="n">axvline</span><span class="p">(</span><span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">lasso_fit</span><span class="o">.</span><span class="n">alpha_</span><span class="p">),</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'red'</span><span class="p">)</span>
</pre></div>
<p><a href="../images/lasso_cv_fits_poly.png">
<img src="../images/lasso_cv_fits_poly.png" width = 450px />
</a></p>
<p>The value of the penalty parameter itself isn’t all that meaningful. So
let’s take a look at what the resulting coefficient estimates are when
we apply the penalty.</p>
<div class="highlight"><pre><span class="k">print</span> <span class="s">'Deg. Coefficient'</span>
<span class="k">print</span> <span class="n">Series</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">r_</span><span class="p">[</span><span class="n">lasso_fit</span><span class="o">.</span><span class="n">intercept_</span><span class="p">,</span> <span class="n">lasso_fit</span><span class="o">.</span><span class="n">coef_</span><span class="p">])</span>
<span class="n">Deg</span><span class="o">.</span> <span class="n">Coefficient</span>
<span class="mi">0</span> <span class="o">-</span><span class="mf">0.003584</span>
<span class="mi">1</span> <span class="o">-</span><span class="mf">5.359452</span>
<span class="mi">2</span> <span class="mf">0.000000</span>
<span class="mi">3</span> <span class="mf">4.689958</span>
<span class="mi">4</span> <span class="o">-</span><span class="mf">0.000000</span>
<span class="mi">5</span> <span class="o">-</span><span class="mf">0.547131</span>
<span class="mi">6</span> <span class="o">-</span><span class="mf">0.047675</span>
<span class="mi">7</span> <span class="mf">0.124998</span>
<span class="mi">8</span> <span class="mf">0.133224</span>
<span class="mi">9</span> <span class="o">-</span><span class="mf">0.171974</span>
<span class="mi">10</span> <span class="mf">0.090685</span>
</pre></div>
<p>So the <span class="caps">LASSO</span>, after selecting a penalty parameter via cross-validation,
results in essentially a 3rd-order polynomial model: <em>y = -5.4x +
4.7x^3^</em>. This makes sense since, as we saw above, we’d captured the
important features of the data by the time we’d fit a 3rd order polynomial.</p>
<h2>Predicting O’Reilly book sales using back-cover descriptions</h2>
<p>Next I’ll use the same model to tackle some real data. We have the sales
ranks of the top-100 selling O’Reilly books. We’d like to see if we use
the text on the back-cover description of the book to predict its rank.
So the output variable is the rank of the book (reversed so that 100 is
the top-selling book, and 1 is the 100th best-selling book), while the
input variables are all the terms that appear in these 100 books’ back
covers. For each book the value of an input variable is the number of
times the term appears on its back cover. Many of the input values will
be zero (for example, the term “javascript” will occur many times in a
book about javascript, but zero times in every other book).</p>
<p>So the matrix of input variables is just our old friend, the
term-document matrix. Creating this (using any of the methods described
in the posts for [chapter 3][] or [chapter 4][]), we can just apply
<code>LassoCV</code> again.</p>
<div class="highlight"><pre><span class="n">lasso_model</span> <span class="o">=</span> <span class="n">LassoCV</span><span class="p">(</span><span class="n">cv</span> <span class="o">=</span> <span class="mi">10</span><span class="p">)</span>
<span class="n">lasso_fit</span> <span class="o">=</span> <span class="n">lasso_model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">desc_tdm</span><span class="o">.</span><span class="n">values</span><span class="p">,</span> <span class="n">ranks</span><span class="o">.</span><span class="n">values</span><span class="p">)</span>
</pre></div>
<p>Because of the size and nature of the input data, this runs pretty
slowly (about 3-5 minutes for me). And, because there seems to be no
good prediction model to be had here, the model doesn’t alway converge.
If we do get a convergent run, we find the <span class="caps">CV</span> procedure wants us to
shrink all the coefficients to zero: no input is worth keeping per the
<span class="caps">LASSO</span>. (Note that since the x-axis in the graph is -log(penalty), moving
left on the axis, towards 0, means more regularization.) This is the
same result the authors find.</p>
<p><a href="../images/lasso_cv_fits_text.png">
<img src="../images/lasso_cv_fits_text.png" width = 450px />
</a></p>
<h1>Logistic regression with cross-validation</h1>
<p>With the previous model a bust, the authors regroup and try to fit a
more simple output variable: a binary indicator of whether the book is
in the top-50 sellers or not. Since they’re modeling a 0/1 outcome, they
use a logistic regression. Like the linear models we used above, we can
also apply regularizers to logistic regression.</p>
<p>In the book, the authors again code up an explicit cross-validation
procedure. The <a href="http://nbviewer.ipython.org/urls/raw.github.com/carljv/Will_it_Python/master/MLFH/ch6/ch6.ipynb">notebook</a> for this chapter has some code that
replicates their procedure, but here I’ll discuss a version that uses
scikit-learn’s <code>GridCV</code> function, which automates the cross-validation
procedure for us. (the term “grid” is a little confusing here, since
we’re only optimizing over one variable, the penalty parameter; the term
“grid” is a little more intuitive in a 2-or-more-dimension search).</p>
<div class="highlight"><pre><span class="n">clf</span> <span class="o">=</span> <span class="n">GridSearchCV</span><span class="p">(</span><span class="n">LogisticRegression</span><span class="p">(</span><span class="n">C</span> <span class="o">=</span> <span class="mf">1.0</span><span class="p">,</span> <span class="n">penalty</span> <span class="o">=</span> <span class="s">'l1'</span><span class="p">),</span>
<span class="n">c_grid</span><span class="p">,</span>
<span class="n">score_func</span> <span class="o">=</span> <span class="n">metrics</span><span class="o">.</span><span class="n">zero_one_score</span><span class="p">,</span> <span class="n">cv</span> <span class="o">=</span> <span class="n">n_cv_folds</span><span class="p">)</span>
<span class="n">clf</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">trainX</span><span class="p">,</span> <span class="n">trainy</span><span class="p">)</span>
</pre></div>
<p>We initialize the <code>GridCV</code> procedure by telling it:</p>
<ul>
<li>What model we’re using: logistic, with a penalty parameter <code>C</code>, initialized at 1.0, using the L1 (<span class="caps">LASSO</span>) penalty.</li>
<li>A grid/array of parameter value candidates to search over: here values of <code>C</code>.</li>
<li>A score function to optimize: before we were using the <span class="caps">RMSE</span> of the regression, here we’ll use a correct classification rate, given by <code>zero_one_score</code>, in scikit-learn’s <code>metrics</code> module.</li>
<li>The number of cross-validation folds to performs; this defined elsewhere in the variable <code>n_cv_folds</code></li>
</ul>
<p>Then I fit the model on training data (a random subset of 80). After
running this, We can check what value it chose for the penalty
parameter, <code>C</code>, and what the in-sample error-rate for this value was.</p>
<div class="highlight"><pre><span class="n">clf</span><span class="o">.</span><span class="n">best_params_</span><span class="p">,</span> <span class="mf">1.0</span> <span class="o">-</span> <span class="n">clf</span><span class="o">.</span><span class="n">best_score_</span>
<span class="p">({</span><span class="s">'C'</span><span class="p">:</span> <span class="mf">0.29377144516536563</span><span class="p">},</span> <span class="mf">0.375</span><span class="p">)</span>
</pre></div>
<p>And again, let’s plot the error rates against values of <code>C</code> to vizualize
how regularization affects the model accuracy.</p>
<div class="highlight"><pre><span class="n">rates</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">1.0</span> <span class="o">-</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">clf</span><span class="o">.</span><span class="n">grid_scores_</span><span class="p">])</span>
<span class="n">stds</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="mf">1.0</span> <span class="o">-</span> <span class="n">x</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span> <span class="o">/</span> <span class="n">math</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">n_cv_folds</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span>
<span class="n">clf</span><span class="o">.</span><span class="n">grid_scores_</span><span class="p">]</span>
<span class="n">plt</span><span class="o">.</span><span class="n">fill_between</span><span class="p">(</span><span class="n">cs</span><span class="p">,</span> <span class="n">rates</span> <span class="o">-</span> <span class="n">stds</span><span class="p">,</span> <span class="n">rates</span> <span class="o">+</span> <span class="n">stds</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'steelblue'</span><span class="p">,</span>
<span class="n">alpha</span> <span class="o">=</span> <span class="o">.</span><span class="mi">4</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">cs</span><span class="p">,</span> <span class="n">rates</span><span class="p">,</span> <span class="s">'o-k'</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Avg. error rate across folds'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'C (regularization parameter)'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Avg. error rate (and +/- 1 s.e.)'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span> <span class="o">=</span> <span class="s">'best'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">gca</span><span class="p">()</span><span class="o">.</span><span class="n">grid</span><span class="p">()</span>
</pre></div>
<p><a href="../images/logistic_cv_errors.png">
<img src="../images/logistic_cv_errors.png" width = 450px />
</a></p>
<p>After fitting to the training set, we can predict on the test set and
and see how accurate the model is on new data using the
<code>classification_report</code> function.</p>
<div class="highlight"><pre>print metrics.classification_report(testy, clf.predict(testX))
precision recall f1-score support
0 0.78 0.44 0.56 16
1 0.18 0.50 0.27 4
avg / total 0.66 0.45 0.50 20
</pre></div>
<p>And the confusion matrix shows we got 9 instances classified correctly
(the diagonal), and 11 incorrectly (the off-diagonal).</p>
<div class="highlight"><pre><span class="k">print</span> <span class="s">' Predicted'</span>
<span class="k">print</span> <span class="s">' Class'</span>
<span class="k">print</span> <span class="n">DataFrame</span><span class="p">(</span><span class="n">metrics</span><span class="o">.</span><span class="n">confusion_matrix</span><span class="p">(</span><span class="n">testy</span><span class="p">,</span> <span class="n">clf</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">testX</span><span class="p">)))</span>
<span class="n">Predicted</span>
<span class="n">Class</span>
<span class="mi">0</span> <span class="mi">1</span>
<span class="mi">0</span> <span class="mi">7</span> <span class="mi">9</span>
<span class="mi">1</span> <span class="mi">2</span> <span class="mi">2</span>
</pre></div>
<h2>Conclusion</h2>
<p>Cross-validation often requires a lot of bookkeeping code. Writing this
over and over again for different applications is inefficient and
error-prone. So it’s great that scikit-learn has functions that
encapsulate the cross-validation process in convenient
abstractions/interfaces that do the bookkeeping for you. It also has a
wide array of useful, cutting-edge models, and the
<a href="http://scikit-learn.org/stable/">documentation</a> is not just clear and organized, but also
educational: there are lots of examples and exposition that explains how
the underlying models work, not just what the <span class="caps">API</span> is.</p>
<p>So even though we didn’t build any kick-ass, high-accuracy predictive
models here, we did get to explore some fundamental methods in building
<span class="caps">ML</span> models, and get acquainted with the powerful tools in scikit-learn.</p>Machine Learning for Hackers Chapter 5: Linear regression (with categorical regressors)2012-12-28T01:32:00-05:00Carltag:slendermeans.org,2012-12-28:ml4h-ch5.html<h2>Introduction</h2>
<p>Chapter 5 of <em>Machine Learning for Hackers</em> is a relatively simple
exercise in running linear regressions. Therefore, this post will be
short, and I’ll only discuss the more interesting regression example,
which nicely shows how patsy formulas handle categorical variables.</p>
<h2>Linear regression with categorical independent variables</h2>
<p>In chapter 5, the authors construct several linear regressions, the last
of which is a multi-variate regression descriping the number of page
views of top-viewed web sites. The regression is pretty straightforward,
but includes two categorical variables: <code>HasAdvertising</code>, which takes
values <code>True</code> or <code>False</code>; and <code>InEnglish</code>, which takes values <code>Yes</code>,
<code>No</code> and <code>NA</code> (missing).</p>
<p>If we include these variables in the formula, then patsy/statmodels will
automatically generate the necessary dummy variables. For
<code>HasAdvertising</code>, we get a dummy variable equal to one when the the
value is <code>True</code>. For <code>InEnglish</code>, which takes three values, we get two
separate dummy variables, one for <code>Yes</code>, one for <code>No</code>, with the missing
value serving as the baseline.</p>
<div class="highlight"><pre><span class="n">model</span> <span class="o">=</span> <span class="s">'np.log(PageViews) ~ np.log(UniqueVisitors) + HasAdvertising +</span>
<span class="n">InEnglish</span><span class="s">'</span>
<span class="n">pageview_fit_multi</span> <span class="o">=</span> <span class="n">ols</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">top_1k_sites</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span>
<span class="k">print</span> <span class="n">pageview_fit_multi</span><span class="o">.</span><span class="n">summary</span><span class="p">()</span>
</pre></div>
<p>Results in:</p>
<div class="highlight"><pre>OLS Regression Results
==============================================================================
Dep. Variable: np.log(PageViews) R-squared: 0.480
Model: OLS Adj. R-squared: 0.478
Method: Least Squares F-statistic: 229.4
Date: Sat, 24 Nov 2012 Prob (F-statistic): 1.52e-139
Time: 09:50:25 Log-Likelihood: -1481.1
No. Observations: 1000 AIC: 2972.
Df Residuals: 995 BIC: 2997.
Df Model: 4
==========================================================================================
coef std err t P\>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------------------
Intercept -1.9450 1.148 -1.695 0.090 -4.197 0.307
HasAdvertising[T.True] 0.3060 0.092 3.336 0.001 0.126 0.486
InEnglish[T.No] 0.8347 0.209 4.001 0.000 0.425 1.244
InEnglish[T.Yes] -0.1691 0.204 -0.828 0.408 -0.570 0.232
np.log(UniqueVisitors) 1.2651 0.071 17.936 0.000 1.127 1.403
==============================================================================
Omnibus: 73.424 Durbin-Watson: 2.068
Prob(Omnibus): 0.000 Jarque-Bera (JB): 92.632
Skew: 0.646 Prob(JB): 7.68e-21
Kurtosis: 3.744 Cond. No. 570.
==============================================================================
</pre></div>
<p>If we were going to do this without the formula <span class="caps">API</span>, we’d have to
explicity make these dummies. For comparison, here’s that.</p>
<div class="highlight"><pre><span class="n">top_1k_sites</span><span class="p">[</span><span class="s">'LogUniqueVisitors'</span><span class="p">]</span> <span class="o">=</span>
<span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">top_1k_sites</span><span class="p">[</span><span class="s">'UniqueVisitors'</span><span class="p">])</span>
<span class="n">top_1k_sites</span><span class="p">[</span><span class="s">'HasAdvertisingYes'</span><span class="p">]</span> <span class="o">=</span>
<span class="n">np</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">top_1k_sites</span><span class="p">[</span><span class="s">'HasAdvertising'</span><span class="p">]</span> <span class="o">==</span> <span class="s">'Yes'</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">top_1k_sites</span><span class="p">[</span><span class="s">'InEnglishYes'</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">top_1k_sites</span><span class="p">[</span><span class="s">'InEnglish'</span><span class="p">]</span>
<span class="o">==</span> <span class="s">'Yes'</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">top_1k_sites</span><span class="p">[</span><span class="s">'InEnglishNo'</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">top_1k_sites</span><span class="p">[</span><span class="s">'InEnglish'</span><span class="p">]</span> <span class="o">==</span> <span class="s">'No'</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">linreg_fit</span> <span class="o">=</span> <span class="n">sm</span><span class="o">.</span><span class="n">OLS</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">top_1k_sites</span><span class="p">[</span><span class="s">'PageViews'</span><span class="p">]),</span>
<span class="n">sm</span><span class="o">.</span><span class="n">add_constant</span><span class="p">(</span><span class="n">top_1k_sites</span><span class="p">[[</span><span class="s">'HasAdvertisingYes'</span><span class="p">,</span>
<span class="s">'LogUniqueVisitors'</span><span class="p">,</span>
<span class="s">'InEnglishNo'</span><span class="p">,</span> <span class="s">'InEnglishYes'</span><span class="p">]],</span>
<span class="n">prepend</span> <span class="o">=</span> <span class="bp">True</span><span class="p">))</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span>
<span class="n">linreg_fit</span><span class="o">.</span><span class="n">summary</span><span class="p">()</span>
</pre></div>Machine Learning for Hackers Chapter 4: Priority e-mail ranking2012-12-28T00:00:00-05:00Carltag:slendermeans.org,2012-12-28:ml4h-ch4.html<h2>Introduction</h2>
<p>I’m not going to write much about this chapter. In my opinion the payoff-to-effort ratio for this project is pretty low. The algorithm for ranking e-mails is pretty straightforward, but in my opinion seriously flawed. Most of the code in the chapter (and there’s a lot of it) revolves around parsing the text in the files. It’s a good exercise in thinking through feature extraction, but it’s not got a lot of new <span class="caps">ML</span> concepts. And from my perspective, there’s not much opportunity to show off any Python goodness. But, I’ll hit a couple of points that are new and interesting.</p>
<p>The complete code is at the Github repo <a href="https://github.com/carljv/Will_it_Python/tree/master/MLFH/ch4">here</a>, and you can read the notebook via nbviewer <a href="http://nbviewer.ipython.org/urls/raw.github.com/carljv/Will_it_Python/master/MLFH/ch4/ch4.ipynb">here</a>.</p>
<p><strong>1. Vectorized string methods in pandas.</strong> Back in <a href="../ml4h-ch1-p2.html">Chapter 1</a>, I groused about lacking vectorized functions for operations on strings or dates in pandas. If it wasn’t a numpy ufunc, you had to use the pandas <code>map()</code> method. That’s changed a lot over the summer, and since pandas 0.9.0, we can call <a href="http://pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-string-methods">vectorized string methods</a>.</p>
<p>For example, here’s the code in my chapter for program that identifies e-mails that are part of a thread, by looking for “re:”-like prefixes on the subjects.</p>
<div class="highlight"><pre><span class="n">reply_pattern</span> <span class="o">=</span> <span class="s">'(re:|re\[\d\]:)'</span>
<span class="n">fwd_pattern</span> <span class="o">=</span> <span class="s">'(fw:|fw[\d]:)'</span>
<span class="k">def</span> <span class="nf">thread_flag</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Returns True if string s matches the thread patterns.</span>
<span class="sd"> If s is a pandas Series, returns a Series of booleans.</span>
<span class="sd"> '''</span>
<span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="nb">basestring</span><span class="p">):</span>
<span class="k">return</span> <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">reply_pattern</span><span class="p">,</span> <span class="n">s</span><span class="p">)</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">s</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="n">reply_pattern</span><span class="p">,</span> <span class="n">re</span><span class="o">.</span><span class="n">I</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">clean_subject</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Removes all the reply and forward labeling from a</span>
<span class="sd"> string (an e-mail subject) s.</span>
<span class="sd"> If s is a pandas Series, returns a Series of cleaned</span>
<span class="sd"> strings.</span>
<span class="sd"> This will help find the initial message in the thread</span>
<span class="sd"> (which won't have any of the reply/forward labeling.</span>
<span class="sd"> '''</span>
<span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="nb">basestring</span><span class="p">):</span>
<span class="n">s_clean</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="n">reply_pattern</span><span class="p">,</span> <span class="s">''</span><span class="p">,</span> <span class="n">s</span><span class="p">,</span> <span class="n">re</span><span class="o">.</span><span class="n">I</span><span class="p">)</span>
<span class="n">s_clean</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="n">fwd_pattern</span><span class="p">,</span> <span class="s">''</span><span class="p">,</span> <span class="n">s_clean</span><span class="p">,</span> <span class="n">re</span><span class="o">.</span><span class="n">I</span><span class="p">)</span>
<span class="n">s_clean</span> <span class="o">=</span> <span class="n">s_clean</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">s_clean</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">reply_pattern</span><span class="p">,</span> <span class="s">''</span><span class="p">,</span> <span class="n">re</span><span class="o">.</span><span class="n">I</span><span class="p">)</span>
<span class="n">s_clean</span> <span class="o">=</span> <span class="n">s_clean</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">fwd_pattern</span><span class="p">,</span> <span class="s">''</span><span class="p">,</span> <span class="n">re</span><span class="o">.</span><span class="n">I</span><span class="p">)</span>
<span class="n">s_clean</span> <span class="o">=</span> <span class="n">s_clean</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
<span class="k">return</span> <span class="n">s_clean</span>
</pre></div>
<p>In <code>thread_flag</code>, if the input is a pandas series of e-mail subject lines, then the function will use a vectorized string function, called with <code>.str.contains()</code> to see if a pattern matching a reply-type prefix is in the subject. The function will therefore return a pandas series of booleans, that are <code>True</code> for all the subjects that have a reply pattern, and <code>False</code> for all the subjects that don’t.</p>
<p>The function <code>clean_subjects</code>, if given a pandas Series input, will use the vectorized string methods <code>.str.replace()</code> and <code>.str.strip()</code> to clean the re- and fwd-like patterns out of the subjects.</p>
<p>Notice there are some differences between the naming of pandas string methods and the base string methods or <code>re</code> module functions that perform similar operations on single strings. For example, there’s no <code>contains</code> function in <code>re</code>; we use <code>re.search()</code>. Similarly <code>.str.replace()</code> does what we’d use <code>re.sub()</code> to do on a single string.</p>
<p><strong>2. More term-document matrices</strong> In <a href="../ml4h-ch3.html">Chapter 3</a> we built a term-document matrix to extract term-frequency features from a set of e-mails. This chapter has a similar exercise, applied to both e-mail messages and their subjects. In the code for that chapter, I built a <span class="caps">TDM</span> function that wrapped the term-document matrix function in the <code>textmining</code> package, adding some options that tried to mimic the <code>tdm</code> function in R’s <code>tm</code> package. I use that same function, <code>tdm_df</code>, here. In the post for that chapter, I lamented that I couldn’t find a decent term-document matrix function for Python. The one in <code>textmining</code> was too barebones and I was surprised there was nothing that fit the bill in <span class="caps">NLTK</span>.</p>
<p>In comments to that post, Vishal Goklani pointed me to the <a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer"><code>CountVectorizer</code></a> function in scikits-learn (in the <code>sklearn.feature_extraction.text</code> module). Despite the rather generic name, this will give you a <span class="caps">TDM</span> from a set of documents, returned in the form of a sparse matrix. Here’s quick-and-dirty wrapper function that returns a <span class="caps">TDM</span> in the form of a pandas DataFrame.</p>
<div class="highlight"><pre><span class="k">def</span> <span class="nf">sklearn_tdm_df</span><span class="p">(</span><span class="n">docs</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Create a term-document matrix (TDM) in the form of a pandas DataFrame</span>
<span class="sd"> Uses sklearn's CountVectorizer function.</span>
<span class="sd"> Parameters</span>
<span class="sd"> ----------</span>
<span class="sd"> docs: a sequence of documents (files, filenames, or the content) to be</span>
<span class="sd"> included in the TDM. See the `input` argument to CountVectorizer.</span>
<span class="sd"> **kwargs: keyword arguments for CountVectorizer options.</span>
<span class="sd"> Returns</span>
<span class="sd"> -------</span>
<span class="sd"> tdm_df: A pandas DataFrame with the term-document matrix. Columns are terms,</span>
<span class="sd"> rows are documents.</span>
<span class="sd"> '''</span>
<span class="c"># Initialize the vectorizer and get term counts in each document.</span>
<span class="n">vectorizer</span> <span class="o">=</span> <span class="n">CountVectorizer</span><span class="p">(</span><span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
<span class="n">word_counts</span> <span class="o">=</span> <span class="n">vectorizer</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">docs</span><span class="p">)</span>
<span class="c"># .vocabulary_ is a Dict whose keys are the terms in the documents,</span>
<span class="c"># and whose entries are the columns in the matrix returned by fit_transform()</span>
<span class="n">vocab</span> <span class="o">=</span> <span class="n">vectorizer</span><span class="o">.</span><span class="n">vocabulary_</span>
<span class="c"># Make a dictionary of Series for each term; convert to DataFrame</span>
<span class="n">count_dict</span> <span class="o">=</span> <span class="p">{</span><span class="n">w</span><span class="p">:</span> <span class="n">Series</span><span class="p">(</span><span class="n">word_counts</span><span class="o">.</span><span class="n">getcol</span><span class="p">(</span><span class="n">vocab</span><span class="p">[</span><span class="n">w</span><span class="p">])</span><span class="o">.</span><span class="n">data</span><span class="p">)</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">vocab</span><span class="p">}</span>
<span class="n">tdm_df</span> <span class="o">=</span> <span class="n">DataFrame</span><span class="p">(</span><span class="n">count_dict</span><span class="p">)</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="k">return</span> <span class="n">tdm_df</span>
<span class="c"># Call the function on e-mail messages. The token_pattern is set so that terms are only</span>
<span class="c"># words with two or more letters (no numbers or punctuation)</span>
<span class="n">message_tdm</span> <span class="o">=</span> <span class="n">sklearn_tdm_df</span><span class="p">(</span><span class="n">train_df</span><span class="p">[</span><span class="s">'message'</span><span class="p">],</span>
<span class="n">stop_words</span> <span class="o">=</span> <span class="s">'english'</span><span class="p">,</span>
<span class="n">charset_error</span> <span class="o">=</span> <span class="s">'ignore'</span><span class="p">,</span>
<span class="n">token_pattern</span> <span class="o">=</span> <span class="s">'[a-zA-Z]{2,}'</span><span class="p">)</span>
</pre></div>
<p><strong>3. Timezone issues and rank instability.</strong> In the book, the authors compute stats measuring how active threads are. This depends on the time-stamps of the messages, which the authors parse out of the e-mail files. They ignore the time-zone information in the time-stamps, and this seems to create some bugs. For example, the following thread has two e-mails:</p>
<div class="highlight"><pre>Name: [sadev] [bug 840] spam_level_char option change/removal
734 2002-09-06 10:56:23-07:00
763 2002-09-06 13:56:19-04:00
</pre></div>
<p>If you ignore the timezones, it looks like 763 comes three hours after 734. But looking at the timezones, you can see that 734 actually comes <em>four seconds after</em> 763. So this is a far more active thread than the code in the book calculates.</p>
<p>This sort of issue has a pretty big effect on the ranks of the messages. The rank is just the product of 5 feature weights (based on sender info., thread activity, and term features). Even though the authors scale the individual feature weights (typically with log-scales), by calculating the final rank as a product, you can get big rank difference based on what might seem to be practically similar features (even without any bugs)—for example, in some cases it doesn’t take a big difference to double a feature’s weight, which then doubles the e-mail’s rank.So it seems to me the ranking procedure in the book is not very stable. This is fine, since it’s just meant to be illustrative, but of course you want to be aware of this issue for a more serious exercise.</p>
<h2>Conclusion</h2>
<p>I didn’t go into much detail here. If you’re interested in seeing a lot of Python and pandas text parsing in action, definitely check out the <a href="http://nbviewer.ipython.org/urls/raw.github.com/carljv/Will_it_Python/master/MLFH/ch4/ch4.ipynb">code</a>.</p>ARM Chapter 5: Logistic models of well-switching in Bangladesh2012-12-22T19:10:00-05:00Carltag:slendermeans.org,2012-12-22:arm-ch5.html<p>The logistic regression we ran for <a href="../ml4h-ch2-p2.html">chapter 2 of <em>Machine Learning for
Hackers</em></a> was pretty simple. So I wanted to find an example that would
dig a little deeper into statsmodels’s capabilities and the power of the
patsy formula language.</p>
<p>So, I’m taking an intermission from <em>Machine Learning for Hackers</em> and
am going to show an example from Gelman and Hill’s <a href="http://www.stat.columbia.edu/~gelman/arm/"><em>Data Analysis Using
Regression and Multilevel/Hierarchical Models</em></a> <em>(“<span class="caps">ARM</span>”)</em>. The chapter
has a great example of going through the process of building,
interpreting, and diagnosing a logistic regression model. We’ll end up
with a model with lots of interactions and variable transforms, which is
a great showcase for patsy and the statmodels formula <span class="caps">API</span>.</p>
<h2>Logistic model of well-switching in Bangladesh</h2>
<p>Our data are information on about 3,000 respondent households in
Bangladesh with wells having an unsafe amount of arsenic. The data
record the amount of arsenic in the respondent’s well, the distance to
the nearest safe well (in meters), whether that respondent “switched”
wells by using a neighbor’s safe well instead of their own, as well as
the respondent’s years of education and a dummy variable indicating
whether they belong to a community association.</p>
<div class="highlight"><pre> switch arsenic dist assoc educ
1 1 2.36 16.826000 0 0
2 1 0.71 47.321999 0 0
3 0 2.07 20.966999 0 10
4 1 1.15 21.486000 0 12
5 1 1.10 40.874001 1 14
...
</pre></div>
<p>Our goal is to model well-switching decision. Since it’s a binary
variable (1 = switch, 0 = no switch), we’ll use logistic regression.</p>
<p>The IPython notebook is at the Github repo <a href="https://github.com/carljv/Will_it_Python/tree/master/ARM/ch5">here</a>, and you can go
<a href="http://nbviewer.ipython.org/urls/raw.github.com/carljv/Will_it_Python/master/ARM/ch5/arsenic_wells_switching.ipynb">here</a> to view it on nbviewer. The analysis follows <em><span class="caps">ARM</span></em> chapter 5.4.</p>
<h2>Model 1: Distance to a safe well</h2>
<p>For our first pass, we’ll just use the distance to the nearest safe
well. Since the distance is recorded in meters, and the effect of one
meter is likely to be very small, we can get nicer model coefficients if
we scale it. Instead of creating a new scaled variable, we’ll just do it
in the formula description using the <code>I()</code> function.</p>
<div class="highlight"><pre><span class="n">model1</span> <span class="o">=</span> <span class="n">logit</span><span class="p">(</span><span class="s">'switch ~ I(dist/100.)'</span><span class="p">,</span> <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span>
<span class="k">print</span> <span class="n">model1</span><span class="o">.</span><span class="n">summary</span><span class="p">()</span>
</pre></div>
<p>Optimization terminated successfully.
Current function value: 2038.118913
Iterations 4
Logit Regression Results</p>
<div class="highlight"><pre>==============================================================================
Dep. Variable: switch No. Observations: 3020
Model: Logit Df Residuals: 3018
Method: MLE Df Model: 1
Date: Sat, 22 Dec 2012 Pseudo R-squ.: 0.01017
Time: 13:05:25 Log-Likelihood: -2038.1
converged: True LL-Null: -2059.0
LLR p-value: 9.798e-11
==================================================================================
coef std err z P>|z| [95.0% Conf. Int.]
----------------------------------------------------------------------------------
Intercept 0.6060 0.060 10.047 0.000 0.488 0.724
I(dist / 100.) -0.6219 0.097 -6.383 0.000 -0.813 -0.431
</pre></div>
<p>Let’s plot this model. We’ll want to jitter the <code>switch</code> data, since
it’s all 0/1 and will over-plot.</p>
<p><a href="../images/switch_dist_jittter.png">
<img src="../images/switch_dist_jittter.png" width=400px />
</a></p>
<p>Another way to look at this is to plot the densities of distance for
switchers and non-switchers. We expect the distribution of switchers to
have more mass over short distances and the distribution of
non-switchers to have more mass over long distances.</p>
<p><a href="../images/switch_dist_kde.png">
<img src="../images/switch_dist_kde.png" width=400px />
</a></p>
<h2>Model 2: Distance to a safe well and the arsenic level of own well</h2>
<p>Next, let’s add the arsenic level as a regressor. We’d expect
respondents with higher arsenic levels to be more motivated to switch.</p>
<div class="highlight"><pre><span class="n">model2</span> <span class="o">=</span> <span class="n">logit</span><span class="p">(</span><span class="s">'switch ~ I(dist / 100.) + arsenic'</span><span class="p">,</span> <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span>
<span class="k">print</span> <span class="n">model2</span><span class="o">.</span><span class="n">summary</span><span class="p">()</span>
<span class="n">Optimization</span> <span class="n">terminated</span> <span class="n">successfully</span><span class="o">.</span>
<span class="n">Current</span> <span class="n">function</span> <span class="n">value</span><span class="p">:</span> <span class="mf">1965.334134</span>
<span class="n">Iterations</span> <span class="mi">5</span>
<span class="n">Logit</span> <span class="n">Regression</span> <span class="n">Results</span>
<span class="o">==============================================================================</span>
<span class="n">Dep</span><span class="o">.</span> <span class="n">Variable</span><span class="p">:</span> <span class="n">switch</span> <span class="n">No</span><span class="o">.</span> <span class="n">Observations</span><span class="p">:</span> <span class="mi">3020</span>
<span class="n">Model</span><span class="p">:</span> <span class="n">Logit</span> <span class="n">Df</span> <span class="n">Residuals</span><span class="p">:</span> <span class="mi">3017</span>
<span class="n">Method</span><span class="p">:</span> <span class="n">MLE</span> <span class="n">Df</span> <span class="n">Model</span><span class="p">:</span> <span class="mi">2</span>
<span class="n">Date</span><span class="p">:</span> <span class="n">Sat</span><span class="p">,</span> <span class="mi">22</span> <span class="n">Dec</span> <span class="mi">2012</span> <span class="n">Pseudo</span> <span class="n">R</span><span class="o">-</span><span class="n">squ</span><span class="o">.</span><span class="p">:</span> <span class="mf">0.04551</span>
<span class="n">Time</span><span class="p">:</span> <span class="mi">13</span><span class="p">:</span><span class="mo">05</span><span class="p">:</span><span class="mi">29</span> <span class="n">Log</span><span class="o">-</span><span class="n">Likelihood</span><span class="p">:</span> <span class="o">-</span><span class="mf">1965.3</span>
<span class="n">converged</span><span class="p">:</span> <span class="bp">True</span> <span class="n">LL</span><span class="o">-</span><span class="n">Null</span><span class="p">:</span> <span class="o">-</span><span class="mf">2059.0</span>
<span class="n">LLR</span> <span class="n">p</span><span class="o">-</span><span class="n">value</span><span class="p">:</span> <span class="mf">1.995e-41</span>
<span class="o">==================================================================================</span>
<span class="n">coef</span> <span class="n">std</span> <span class="n">err</span> <span class="n">z</span> <span class="n">P</span><span class="o">>|</span><span class="n">z</span><span class="o">|</span> <span class="p">[</span><span class="mf">95.0</span><span class="o">%</span> <span class="n">Conf</span><span class="o">.</span> <span class="n">Int</span><span class="o">.</span><span class="p">]</span>
<span class="o">----------------------------------------------------------------------------------</span>
<span class="n">Intercept</span> <span class="mf">0.0027</span> <span class="mf">0.079</span> <span class="mf">0.035</span> <span class="mf">0.972</span> <span class="o">-</span><span class="mf">0.153</span> <span class="mf">0.158</span>
<span class="n">I</span><span class="p">(</span><span class="n">dist</span> <span class="o">/</span> <span class="mf">100.</span><span class="p">)</span> <span class="o">-</span><span class="mf">0.8966</span> <span class="mf">0.104</span> <span class="o">-</span><span class="mf">8.593</span> <span class="mf">0.000</span> <span class="o">-</span><span class="mf">1.101</span> <span class="o">-</span><span class="mf">0.692</span>
<span class="n">arsenic</span> <span class="mf">0.4608</span> <span class="mf">0.041</span> <span class="mf">11.134</span> <span class="mf">0.000</span> <span class="mf">0.380</span> <span class="mf">0.542</span>
<span class="o">==================================================================================</span>
</pre></div>
<p>Which is what we see. The coefficients are what we’d expect: the farther
to a safe well, the less likely a respondent is to switch, but the
higher the arsenic level in their own well, the more likely.</p>
<h3>Marginal Effects</h3>
<p>To see the effect of these on the probability of switching, let’s
calculate the marginal effects at the mean of the data.</p>
<div class="highlight"><pre><span class="n">model2</span><span class="o">.</span><span class="n">margeff</span><span class="p">(</span><span class="n">at</span> <span class="o">=</span> <span class="s">'mean'</span><span class="p">)</span>
<span class="n">array</span><span class="p">([</span><span class="o">-</span><span class="mf">0.21806505</span><span class="p">,</span> <span class="mf">0.11206108</span><span class="p">])</span>
</pre></div>
<p>So, for the mean respondent, an increase of 100 meters to the nearest
safe well is associated with a 22% lower probability of switching. But
an increase of 1 in the arsenic level is associated with an 11% higher
probability of switching.</p>
<h3>Class separability</h3>
<p>To get a sense of how well this model might classify switchers and
non-switchers, we can plot each class of respondent in
(distance-arsenic)-space.
We don’t see very clean separation, so we’d expect the model to have a
fairly high error rate. But we do notice that the
short-distance/high-arsenic region of the graph is mostly comprised
switchers, and the long-distance/low-arsenic region is mostly comprised
of non-switchers.</p>
<p><a href="../images/dist_arsenic_sep.png">
<img src="../images/dist_arsenic_sep.png" width=400px />
</a></p>
<h2>Model 3: Adding an interaction</h2>
<p>It’s sensible that distance and arsenic would interact in the model. In
other words, the effect of an 100 meters on your decision to switch
would be affected by how much arsenic is in your well.</p>
<p>Again, we don’t have to pre-compute an explicit interaction variable. We
can just specify an interaction in the formula description using the <code>:</code>
operator.</p>
<div class="highlight"><pre><span class="n">model3</span> <span class="o">=</span> <span class="n">logit</span><span class="p">(</span><span class="s">'switch ~ I(dist / 100.) + arsenic + I(dist / 100.):arsenic'</span><span class="p">,</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span>
<span class="k">print</span> <span class="n">model3</span><span class="o">.</span><span class="n">summary</span><span class="p">()</span>
<span class="n">Optimization</span> <span class="n">terminated</span> <span class="n">successfully</span><span class="o">.</span>
<span class="n">Current</span> <span class="n">function</span> <span class="n">value</span><span class="p">:</span> <span class="mf">1963.814202</span>
<span class="n">Iterations</span> <span class="mi">5</span>
<span class="n">Logit</span> <span class="n">Regression</span> <span class="n">Results</span>
<span class="o">==============================================================================</span>
<span class="n">Dep</span><span class="o">.</span> <span class="n">Variable</span><span class="p">:</span> <span class="n">switch</span> <span class="n">No</span><span class="o">.</span> <span class="n">Observations</span><span class="p">:</span> <span class="mi">3020</span>
<span class="n">Model</span><span class="p">:</span> <span class="n">Logit</span> <span class="n">Df</span> <span class="n">Residuals</span><span class="p">:</span> <span class="mi">3016</span>
<span class="n">Method</span><span class="p">:</span> <span class="n">MLE</span> <span class="n">Df</span> <span class="n">Model</span><span class="p">:</span> <span class="mi">3</span>
<span class="n">Date</span><span class="p">:</span> <span class="n">Sat</span><span class="p">,</span> <span class="mi">22</span> <span class="n">Dec</span> <span class="mi">2012</span> <span class="n">Pseudo</span> <span class="n">R</span><span class="o">-</span><span class="n">squ</span><span class="o">.</span><span class="p">:</span> <span class="mf">0.04625</span>
<span class="n">Time</span><span class="p">:</span> <span class="mi">13</span><span class="p">:</span><span class="mo">05</span><span class="p">:</span><span class="mi">33</span> <span class="n">Log</span><span class="o">-</span><span class="n">Likelihood</span><span class="p">:</span> <span class="o">-</span><span class="mf">1963.8</span>
<span class="n">converged</span><span class="p">:</span> <span class="bp">True</span> <span class="n">LL</span><span class="o">-</span><span class="n">Null</span><span class="p">:</span> <span class="o">-</span><span class="mf">2059.0</span>
<span class="n">LLR</span> <span class="n">p</span><span class="o">-</span><span class="n">value</span><span class="p">:</span> <span class="mf">4.830e-41</span>
<span class="o">==========================================================================================</span>
<span class="n">coef</span> <span class="n">std</span> <span class="n">err</span> <span class="n">z</span> <span class="n">P</span><span class="o">>|</span><span class="n">z</span><span class="o">|</span> <span class="p">[</span><span class="mf">95.0</span><span class="o">%</span> <span class="n">Conf</span><span class="o">.</span> <span class="n">Int</span><span class="o">.</span><span class="p">]</span>
<span class="o">------------------------------------------------------------------------------------------</span>
<span class="n">Intercept</span> <span class="o">-</span><span class="mf">0.1479</span> <span class="mf">0.118</span> <span class="o">-</span><span class="mf">1.258</span> <span class="mf">0.208</span> <span class="o">-</span><span class="mf">0.378</span> <span class="mf">0.083</span>
<span class="n">I</span><span class="p">(</span><span class="n">dist</span> <span class="o">/</span> <span class="mf">100.</span><span class="p">)</span> <span class="o">-</span><span class="mf">0.5772</span> <span class="mf">0.209</span> <span class="o">-</span><span class="mf">2.759</span> <span class="mf">0.006</span> <span class="o">-</span><span class="mf">0.987</span> <span class="o">-</span><span class="mf">0.167</span>
<span class="n">arsenic</span> <span class="mf">0.5560</span> <span class="mf">0.069</span> <span class="mf">8.021</span> <span class="mf">0.000</span> <span class="mf">0.420</span> <span class="mf">0.692</span>
<span class="n">I</span><span class="p">(</span><span class="n">dist</span> <span class="o">/</span> <span class="mf">100.</span><span class="p">):</span><span class="n">arsenic</span> <span class="o">-</span><span class="mf">0.1789</span> <span class="mf">0.102</span> <span class="o">-</span><span class="mf">1.748</span> <span class="mf">0.080</span> <span class="o">-</span><span class="mf">0.379</span> <span class="mf">0.022</span>
<span class="o">==========================================================================================</span>
</pre></div>
<p>The coefficient on the interaction is negative and significant. While we
can’t directly intepret its quantitative effect on switching, the
qualitative interpretation gels with our intuition. Distance has a
negative effect on switching, but this negative effect is reduced when
arsenic levels are high. Alternatively, the arsenic level have a
positive effect on switching, but this positive effect is reduced as
distance to the nearest safe well increases.</p>
<h2>Model 4: Adding education, more interactions, and centering variables</h2>
<p>Respondents with more eduction might have a better understanding of the
harmful effects of arsenic and therefore may be more likely to switch.
Education is in years, so we’ll scale it for more sensible coefficients.
We’ll also include interactions amongst all the regressors.</p>
<p>We’re also going to center the variables, to help with interpretation of
the coefficients. Once more, we can just do this in the formula, without
pre-computing centered variables.</p>
<div class="highlight"><pre><span class="n">model_form</span> <span class="o">=</span> <span class="p">(</span><span class="s">'switch ~ center(I(dist / 100.)) + center(arsenic) + '</span> <span class="o">+</span>
<span class="s">'center(I(educ / 4.)) + '</span> <span class="o">+</span>
<span class="s">'center(I(dist / 100.)) : center(arsenic) + '</span> <span class="o">+</span>
<span class="s">'center(I(dist / 100.)) : center(I(educ / 4.)) + '</span> <span class="o">+</span>
<span class="s">'center(arsenic) : center(I(educ / 4.))'</span>
<span class="p">)</span>
<span class="n">model4</span> <span class="o">=</span> <span class="n">logit</span><span class="p">(</span><span class="n">model_form</span><span class="p">,</span> <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span>
<span class="k">print</span> <span class="n">model4</span><span class="o">.</span><span class="n">summary</span><span class="p">()</span>
<span class="n">Optimization</span> <span class="n">terminated</span> <span class="n">successfully</span><span class="o">.</span>
<span class="n">Current</span> <span class="n">function</span> <span class="n">value</span><span class="p">:</span> <span class="mf">1945.871775</span>
<span class="n">Iterations</span> <span class="mi">5</span>
<span class="n">Logit</span> <span class="n">Regression</span> <span class="n">Results</span>
<span class="o">==============================================================================</span>
<span class="n">Dep</span><span class="o">.</span> <span class="n">Variable</span><span class="p">:</span> <span class="n">switch</span> <span class="n">No</span><span class="o">.</span> <span class="n">Observations</span><span class="p">:</span> <span class="mi">3020</span>
<span class="n">Model</span><span class="p">:</span> <span class="n">Logit</span> <span class="n">Df</span> <span class="n">Residuals</span><span class="p">:</span> <span class="mi">3013</span>
<span class="n">Method</span><span class="p">:</span> <span class="n">MLE</span> <span class="n">Df</span> <span class="n">Model</span><span class="p">:</span> <span class="mi">6</span>
<span class="n">Date</span><span class="p">:</span> <span class="n">Sat</span><span class="p">,</span> <span class="mi">22</span> <span class="n">Dec</span> <span class="mi">2012</span> <span class="n">Pseudo</span> <span class="n">R</span><span class="o">-</span><span class="n">squ</span><span class="o">.</span><span class="p">:</span> <span class="mf">0.05497</span>
<span class="n">Time</span><span class="p">:</span> <span class="mi">13</span><span class="p">:</span><span class="mo">05</span><span class="p">:</span><span class="mi">35</span> <span class="n">Log</span><span class="o">-</span><span class="n">Likelihood</span><span class="p">:</span> <span class="o">-</span><span class="mf">1945.9</span>
<span class="n">converged</span><span class="p">:</span> <span class="bp">True</span> <span class="n">LL</span><span class="o">-</span><span class="n">Null</span><span class="p">:</span> <span class="o">-</span><span class="mf">2059.0</span>
<span class="n">LLR</span> <span class="n">p</span><span class="o">-</span><span class="n">value</span><span class="p">:</span> <span class="mf">4.588e-46</span>
<span class="o">===============================================================================================================</span>
<span class="n">coef</span> <span class="n">std</span> <span class="n">err</span> <span class="n">z</span> <span class="n">P</span><span class="o">>|</span><span class="n">z</span><span class="o">|</span> <span class="p">[</span><span class="mf">95.0</span><span class="o">%</span> <span class="n">Conf</span><span class="o">.</span> <span class="n">Int</span><span class="o">.</span><span class="p">]</span>
<span class="o">---------------------------------------------------------------------------------------------------------------</span>
<span class="n">Intercept</span> <span class="mf">0.3563</span> <span class="mf">0.040</span> <span class="mf">8.844</span> <span class="mf">0.000</span> <span class="mf">0.277</span> <span class="mf">0.435</span>
<span class="n">center</span><span class="p">(</span><span class="n">I</span><span class="p">(</span><span class="n">dist</span> <span class="o">/</span> <span class="mf">100.</span><span class="p">))</span> <span class="o">-</span><span class="mf">0.9029</span> <span class="mf">0.107</span> <span class="o">-</span><span class="mf">8.414</span> <span class="mf">0.000</span> <span class="o">-</span><span class="mf">1.113</span> <span class="o">-</span><span class="mf">0.693</span>
<span class="n">center</span><span class="p">(</span><span class="n">arsenic</span><span class="p">)</span> <span class="mf">0.4950</span> <span class="mf">0.043</span> <span class="mf">11.497</span> <span class="mf">0.000</span> <span class="mf">0.411</span> <span class="mf">0.579</span>
<span class="n">center</span><span class="p">(</span><span class="n">I</span><span class="p">(</span><span class="n">educ</span> <span class="o">/</span> <span class="mf">4.</span><span class="p">))</span> <span class="mf">0.1850</span> <span class="mf">0.039</span> <span class="mf">4.720</span> <span class="mf">0.000</span> <span class="mf">0.108</span> <span class="mf">0.262</span>
<span class="n">center</span><span class="p">(</span><span class="n">I</span><span class="p">(</span><span class="n">dist</span> <span class="o">/</span> <span class="mf">100.</span><span class="p">)):</span><span class="n">center</span><span class="p">(</span><span class="n">arsenic</span><span class="p">)</span> <span class="o">-</span><span class="mf">0.1177</span> <span class="mf">0.104</span> <span class="o">-</span><span class="mf">1.137</span> <span class="mf">0.256</span>
<span class="o">-</span><span class="mf">0.321</span> <span class="mf">0.085</span>
<span class="n">center</span><span class="p">(</span><span class="n">I</span><span class="p">(</span><span class="n">dist</span> <span class="o">/</span> <span class="mf">100.</span><span class="p">)):</span><span class="n">center</span><span class="p">(</span><span class="n">I</span><span class="p">(</span><span class="n">educ</span> <span class="o">/</span> <span class="mf">4.</span><span class="p">))</span> <span class="mf">0.3227</span> <span class="mf">0.107</span> <span class="mf">3.026</span> <span class="mf">0.002</span>
<span class="mf">0.114</span> <span class="mf">0.532</span>
<span class="n">center</span><span class="p">(</span><span class="n">arsenic</span><span class="p">):</span><span class="n">center</span><span class="p">(</span><span class="n">I</span><span class="p">(</span><span class="n">educ</span> <span class="o">/</span> <span class="mf">4.</span><span class="p">))</span> <span class="mf">0.0722</span> <span class="mf">0.044</span> <span class="mf">1.647</span> <span class="mf">0.100</span> <span class="o">-</span><span class="mf">0.014</span>
<span class="mf">0.158</span>
<span class="o">===============================================================================================================</span>
</pre></div>
<h3>Model assessment: binned residual plots</h3>
<p>Plotting residuals to regressors can alert us to issues like
nonlinearity or heteroskedasticity. Plotting raw residuals in a binary
model isn’t usually informative, so we do some smoothing. Here, we’ll
averaging the residuals within bins of the regressor. (A lowess or
moving average might also work.)</p>
<p>I’m going to write a function to provide the binned residual data
dynamically (and another helper function to plot the data). To create
the bins I’m going to use the handy <code>qcut</code> function in pandas, which
bins a vector of data into quantiles. Then I’ll use <code>groupby</code> to
calculate the bin means and confidence intervals.</p>
<div class="highlight"><pre><span class="k">def</span> <span class="nf">bin_residuals</span><span class="p">(</span><span class="n">resid</span><span class="p">,</span> <span class="n">var</span><span class="p">,</span> <span class="n">bins</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Compute average residuals within bins of a variable.</span>
<span class="sd"> Returns a dataframe indexed by the bins, with the bin midpoint,</span>
<span class="sd"> the residual average within the bin, and the confidence interval</span>
<span class="sd"> bounds.</span>
<span class="sd"> '''</span>
<span class="n">resid_df</span> <span class="o">=</span> <span class="n">DataFrame</span><span class="p">({</span><span class="s">'var'</span><span class="p">:</span> <span class="n">var</span><span class="p">,</span> <span class="s">'resid'</span><span class="p">:</span> <span class="n">resid</span><span class="p">})</span>
<span class="n">resid_df</span><span class="p">[</span><span class="s">'bins'</span><span class="p">]</span> <span class="o">=</span> <span class="n">qcut</span><span class="p">(</span><span class="n">var</span><span class="p">,</span> <span class="n">bins</span><span class="p">)</span>
<span class="n">bin_group</span> <span class="o">=</span> <span class="n">resid_df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'bins'</span><span class="p">)</span>
<span class="n">bin_df</span> <span class="o">=</span> <span class="n">bin_group</span><span class="p">[</span><span class="s">'var'</span><span class="p">,</span> <span class="s">'resid'</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="n">bin_df</span><span class="p">[</span><span class="s">'count'</span><span class="p">]</span> <span class="o">=</span> <span class="n">bin_group</span><span class="p">[</span><span class="s">'resid'</span><span class="p">]</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="n">bin_df</span><span class="p">[</span><span class="s">'lower_ci'</span><span class="p">]</span> <span class="o">=</span> <span class="o">-</span><span class="mi">2</span> <span class="o">*</span> <span class="p">(</span><span class="n">bin_group</span><span class="p">[</span><span class="s">'resid'</span><span class="p">]</span><span class="o">.</span><span class="n">std</span><span class="p">()</span> <span class="o">/</span>
<span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">bin_group</span><span class="p">[</span><span class="s">'resid'</span><span class="p">]</span><span class="o">.</span><span class="n">count</span><span class="p">()))</span>
<span class="n">bin_df</span><span class="p">[</span><span class="s">'upper_ci'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="p">(</span><span class="n">bin_group</span><span class="p">[</span><span class="s">'resid'</span><span class="p">]</span><span class="o">.</span><span class="n">std</span><span class="p">()</span> <span class="o">/</span>
<span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">bin_df</span><span class="p">[</span><span class="s">'count'</span><span class="p">]))</span>
<span class="n">bin_df</span> <span class="o">=</span> <span class="n">bin_df</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="s">'var'</span><span class="p">)</span>
<span class="k">return</span><span class="p">(</span><span class="n">bin_df</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">plot_binned_residuals</span><span class="p">(</span><span class="n">bin_df</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Plotted binned residual averages and confidence intervals.</span>
<span class="sd"> '''</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">bin_df</span><span class="p">[</span><span class="s">'var'</span><span class="p">],</span> <span class="n">bin_df</span><span class="p">[</span><span class="s">'resid'</span><span class="p">],</span> <span class="s">'.'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">bin_df</span><span class="p">[</span><span class="s">'var'</span><span class="p">],</span> <span class="n">bin_df</span><span class="p">[</span><span class="s">'lower_ci'</span><span class="p">],</span> <span class="s">'-r'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">bin_df</span><span class="p">[</span><span class="s">'var'</span><span class="p">],</span> <span class="n">bin_df</span><span class="p">[</span><span class="s">'upper_ci'</span><span class="p">],</span> <span class="s">'-r'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">axhline</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'gray'</span><span class="p">,</span> <span class="n">lw</span> <span class="o">=</span> <span class="o">.</span><span class="mi">5</span><span class="p">)</span>
<span class="n">arsenic_resids</span> <span class="o">=</span> <span class="n">bin_residuals</span><span class="p">(</span><span class="n">model4</span><span class="o">.</span><span class="n">resid</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s">'arsenic'</span><span class="p">],</span> <span class="mi">40</span><span class="p">)</span>
<span class="n">dist_resids</span> <span class="o">=</span> <span class="n">bin_residuals</span><span class="p">(</span><span class="n">model4</span><span class="o">.</span><span class="n">resid</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s">'dist'</span><span class="p">],</span> <span class="mi">40</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span> <span class="o">=</span> <span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">121</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Residual (bin avg.)'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Arsenic (bin avg.)'</span><span class="p">)</span>
<span class="n">plot_binned_residuals</span><span class="p">(</span><span class="n">arsenic_resids</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">122</span><span class="p">)</span>
<span class="n">plot_binned_residuals</span><span class="p">(</span><span class="n">dist_resids</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Residual (bin avg.)'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Distance (bin avg.)'</span><span class="p">)</span>
</pre></div>
<p><a href="../images/arsenic_dist_bin_resid.png">
<img src="../images/arsenic_dist_bin_resid.png" width=400px />
</a></p>
<h2>Model 5: log-scaling arsenic</h2>
<p>The binned residual plot indicates some nonlinearity in the arsenic
variable. Note how the model over-estimated for low arsenic and
underestimates for high arsenic. This suggests a log transformation or
something similar.</p>
<p>We can again do this transformation right in the formula.</p>
<div class="highlight"><pre><span class="n">model_form</span> <span class="o">=</span> <span class="p">(</span><span class="s">'switch ~ center(I(dist / 100.)) +</span>
<span class="n">center</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">arsenic</span><span class="p">))</span> <span class="o">+</span> <span class="s">' +</span>
<span class="s">'center(I(educ / 4.)) + '</span> <span class="o">+</span>
<span class="s">'center(I(dist / 100.)) : center(np.log(arsenic)) + '</span> <span class="o">+</span>
<span class="s">'center(I(dist / 100.)) : center(I(educ / 4.)) + '</span> <span class="o">+</span>
<span class="s">'center(np.log(arsenic)) : center(I(educ / 4.))'</span>
<span class="p">)</span>
<span class="n">model5</span> <span class="o">=</span> <span class="n">logit</span><span class="p">(</span><span class="n">model_form</span><span class="p">,</span> <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span>
<span class="k">print</span> <span class="n">model5</span><span class="o">.</span><span class="n">summary</span><span class="p">()</span>
<span class="n">Optimization</span> <span class="n">terminated</span> <span class="n">successfully</span><span class="o">.</span>
<span class="n">Current</span> <span class="n">function</span> <span class="n">value</span><span class="p">:</span> <span class="mf">1931.554102</span>
<span class="n">Iterations</span> <span class="mi">5</span>
<span class="n">Logit</span> <span class="n">Regression</span> <span class="n">Results</span>
<span class="o">==============================================================================</span>
<span class="n">Dep</span><span class="o">.</span> <span class="n">Variable</span><span class="p">:</span> <span class="n">switch</span> <span class="n">No</span><span class="o">.</span> <span class="n">Observations</span><span class="p">:</span> <span class="mi">3020</span>
<span class="n">Model</span><span class="p">:</span> <span class="n">Logit</span> <span class="n">Df</span> <span class="n">Residuals</span><span class="p">:</span> <span class="mi">3013</span>
<span class="n">Method</span><span class="p">:</span> <span class="n">MLE</span> <span class="n">Df</span> <span class="n">Model</span><span class="p">:</span> <span class="mi">6</span>
<span class="n">Date</span><span class="p">:</span> <span class="n">Sat</span><span class="p">,</span> <span class="mi">22</span> <span class="n">Dec</span> <span class="mi">2012</span> <span class="n">Pseudo</span> <span class="n">R</span><span class="o">-</span><span class="n">squ</span><span class="o">.</span><span class="p">:</span> <span class="mf">0.06192</span>
<span class="n">Time</span><span class="p">:</span> <span class="mi">13</span><span class="p">:</span><span class="mo">05</span><span class="p">:</span><span class="mi">57</span> <span class="n">Log</span><span class="o">-</span><span class="n">Likelihood</span><span class="p">:</span> <span class="o">-</span><span class="mf">1931.6</span>
<span class="n">converged</span><span class="p">:</span> <span class="bp">True</span> <span class="n">LL</span><span class="o">-</span><span class="n">Null</span><span class="p">:</span> <span class="o">-</span><span class="mf">2059.0</span>
<span class="n">LLR</span> <span class="n">p</span><span class="o">-</span><span class="n">value</span><span class="p">:</span> <span class="mf">3.517e-52</span>
<span class="o">==================================================================================================================</span>
<span class="n">coef</span> <span class="n">std</span> <span class="n">err</span> <span class="n">z</span> <span class="n">P</span><span class="o">>|</span><span class="n">z</span><span class="o">|</span> <span class="p">[</span><span class="mf">95.0</span><span class="o">%</span> <span class="n">Conf</span><span class="o">.</span> <span class="n">Int</span><span class="o">.</span><span class="p">]</span>
<span class="o">------------------------------------------------------------------------------------------------------------------</span>
<span class="n">Intercept</span> <span class="mf">0.3452</span> <span class="mf">0.040</span> <span class="mf">8.528</span> <span class="mf">0.000</span> <span class="mf">0.266</span> <span class="mf">0.425</span>
<span class="n">center</span><span class="p">(</span><span class="n">I</span><span class="p">(</span><span class="n">dist</span> <span class="o">/</span> <span class="mf">100.</span><span class="p">))</span> <span class="o">-</span><span class="mf">0.9796</span> <span class="mf">0.111</span> <span class="o">-</span><span class="mf">8.809</span> <span class="mf">0.000</span> <span class="o">-</span><span class="mf">1.197</span> <span class="o">-</span><span class="mf">0.762</span>
<span class="n">center</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">arsenic</span><span class="p">))</span> <span class="mf">0.9036</span> <span class="mf">0.070</span> <span class="mf">12.999</span> <span class="mf">0.000</span> <span class="mf">0.767</span> <span class="mf">1.040</span>
<span class="n">center</span><span class="p">(</span><span class="n">I</span><span class="p">(</span><span class="n">educ</span> <span class="o">/</span> <span class="mf">4.</span><span class="p">))</span> <span class="mf">0.1785</span> <span class="mf">0.039</span> <span class="mf">4.577</span> <span class="mf">0.000</span> <span class="mf">0.102</span> <span class="mf">0.255</span>
<span class="n">center</span><span class="p">(</span><span class="n">I</span><span class="p">(</span><span class="n">dist</span> <span class="o">/</span> <span class="mf">100.</span><span class="p">)):</span><span class="n">center</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">arsenic</span><span class="p">))</span> <span class="o">-</span><span class="mf">0.1567</span> <span class="mf">0.185</span> <span class="o">-</span><span class="mf">0.846</span>
<span class="mf">0.397</span> <span class="o">-</span><span class="mf">0.520</span> <span class="mf">0.206</span>
<span class="n">center</span><span class="p">(</span><span class="n">I</span><span class="p">(</span><span class="n">dist</span> <span class="o">/</span> <span class="mf">100.</span><span class="p">)):</span><span class="n">center</span><span class="p">(</span><span class="n">I</span><span class="p">(</span><span class="n">educ</span> <span class="o">/</span> <span class="mf">4.</span><span class="p">))</span> <span class="mf">0.3384</span> <span class="mf">0.108</span> <span class="mf">3.141</span> <span class="mf">0.002</span>
<span class="mf">0.127</span> <span class="mf">0.550</span>
<span class="n">center</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">arsenic</span><span class="p">)):</span><span class="n">center</span><span class="p">(</span><span class="n">I</span><span class="p">(</span><span class="n">educ</span> <span class="o">/</span> <span class="mf">4.</span><span class="p">))</span> <span class="mf">0.0601</span> <span class="mf">0.070</span> <span class="mf">0.855</span> <span class="mf">0.393</span>
<span class="o">-</span><span class="mf">0.078</span> <span class="mf">0.198</span>
<span class="o">==================================================================================================================</span>
</pre></div>
<p>And the binned residual plot for arsenic now looks better.</p>
<p><a href="../images/logarsenic_dist_bin_resid.png">
<img src="../images/logarsenic_dist_bin_resid.png" width=400px />
</a></p>
<h3>Model error rates</h3>
<p>The <code>pred_table()</code> gives us a confusion matrix for the model. We can use
this to compute the error rate of the model.</p>
<p>We should compare this to the null error rates, which comes from a model
that just classifies everything as whatever the most prevalent response
is. Here 58% of the respondents were switchers, so the null model just
classifies everyone as a switcher, and therefore has an error rate of 42%.</p>
<div class="highlight"><pre><span class="k">print</span> <span class="n">model5</span><span class="o">.</span><span class="n">pred_table</span><span class="p">()</span>
<span class="k">print</span> <span class="s">'Model Error rate: {0: 3.0%}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span>
<span class="mi">1</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">diag</span><span class="p">(</span><span class="n">model5</span><span class="o">.</span><span class="n">pred_table</span><span class="p">())</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span> <span class="o">/</span> <span class="n">model5</span><span class="o">.</span><span class="n">pred_table</span><span class="p">()</span><span class="o">.</span><span class="n">sum</span><span class="p">())</span>
<span class="k">print</span> <span class="s">'Null Error Rate: {0: 3.0%}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">df</span><span class="p">[</span><span class="s">'switch'</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
<span class="p">[[</span> <span class="mf">568.</span> <span class="mf">715.</span><span class="p">]</span>
<span class="p">[</span> <span class="mf">387.</span> <span class="mf">1350.</span><span class="p">]]</span>
<span class="n">Model</span> <span class="n">Error</span> <span class="n">rate</span><span class="p">:</span> <span class="mi">36</span><span class="o">%</span>
<span class="n">Null</span> <span class="n">Error</span> <span class="n">Rate</span><span class="p">:</span> <span class="mi">42</span><span class="o">%</span>
</pre></div>
<h2>Conclusion</h2>
<p>So this was a more in-depth example of running a logistic regression
with statsmodels and the formula <span class="caps">API</span>. Unlike last time, when we were
just specifying the variables in the model, here we used the formula
language to apply transforms and create interactions. I really love
this: it drastically reduces the number of steps between thinking up a
model and fitting it.</p>Machine Learning for Hackers Chapter 2, Part 2: Logistic regression with statsmodels2012-12-21T04:04:00-05:00Carltag:slendermeans.org,2012-12-21:ml4h-ch2-p2.html<h2>Introduction</h2>
<p>I last left chapter 2 of <em>Maching Learning for Hackers</em> (a long time
ago), running some kernel density estimators on height and weight data
(see <a href="../ml4h-ch2-p1.html">here</a>. The next part of the chapter plots a scatterplot of
weight vs. height and runs a lowess smoother through it. I’m not going
to write any more about the lowess function in statsmodels. I’ve
discussed some issues with it (i.e. it’s slow) <a href="../lowess-speed.html">here</a>. And it’s my
sense that the lowess <span class="caps">API</span>, as it is now in statsmodels, is not long for
this world. The code is all in the IPython notebooks in <a href="https://github.com/carljv/Will_it_Python/tree/master/MLFH/CH2">the Github
repo</a> and is pretty straightforward.</p>
<h2>Patsy and statsmodels formulas</h2>
<p>What I want to skip to here is the logistic regressions the authors run
to close out the chapter. Back in the spring, I coded up the chapter in
<a href="http://nbviewer.ipython.org/urls/raw.github.com/carljv/Will_it_Python/master/MLFH/CH2/ch2.ipynb">this notebook</a>. At this point, there wasn’t really much cohesion
between pandas and statsmodels. You’d end up doing data exploration and
munging with pandas, then pulling what you needed out of dataframes into
numpy arrays, and passing those arrays to statsmodels. (After writing
seemingly needless boilerplate code like
<code>X = sm.add_constant(X, prepend = True)</code>. Who’s out there running all
these regressions without constant terms, such that it makes sense to
force the use to explicitly add a constant vector to the data matrix?)</p>
<p>Over the summer, though, something quite cool happened. <a href="https://patsy.readthedocs.org/en/latest/#">patsy</a>
brought a formula interface to Python, and it got integrated into a
number components of statsmodels. Skipper Seabold’s <a href="http://jseabold.net/presentations/seabold_pydata2012.html#slide1">Pydata
presentation</a> is a good overview and demo. In a nutshell, statsmodels
now talks to your pandas dataframes via an expressive “formula”
description of your model.</p>
<p>For example, imagine we had a dataframe, <code>df</code>, with variables <code>x1</code>,
<code>x2</code>, and <code>y</code>. If we wanted to regress <code>y</code> on <code>x1</code> and <code>x2</code> with the
standard statmodels <span class="caps">API</span>, we’d code something like the following:</p>
<div class="highlight"><pre><span class="n">Xmat</span> <span class="o">=</span> <span class="n">sm</span><span class="o">.</span><span class="n">add_constant</span><span class="p">(</span><span class="n">df</span><span class="p">[[</span><span class="s">'x1'</span><span class="p">,</span> <span class="s">'x2'</span><span class="p">]]</span><span class="o">.</span><span class="n">values</span><span class="p">,</span> <span class="n">prepend</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="n">yvec</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'y'</span><span class="p">]</span><span class="o">.</span><span class="n">values</span>
<span class="n">ols_model</span> <span class="o">=</span> <span class="n">OLS</span><span class="p">(</span><span class="n">yvec</span><span class="p">,</span> <span class="n">Xmat</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span>
</pre></div>
<p>Which is tolerable with short variable names. Once you start using
longer names or need more <span class="caps">RHS</span> variables it becomes a mess. With patsy
and the formula <span class="caps">API</span>, you just have:</p>
<div class="highlight"><pre><span class="n">ols_model</span> <span class="o">=</span> <span class="n">ols</span><span class="p">(</span><span class="s">'y \~ x1 + x2'</span><span class="p">,</span> <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span>
</pre></div>
<p>Which is just as simple as using <code>lm</code> in R. You can also specify
variable transformations and interactions in the formula, without
needing to pre-compute variable for them. It’s pretty slick.</p>
<p>All of this is still brand new, and largely undocumented, so proceed
with caution. But I’ve gotten very excited incorporating it into my
code. Stuff I wrote just 5 or 6 months ago looks clunky and outdated.</p>
<p>So I’ve updated the IPython notebook for chapter 2, <a href="http://nbviewer.ipython.org/urls/raw.github.com/carljv/Will_it_Python/master/MLFH/CH2/ch2_with_formulas.ipynb">here</a>, to
incorporate the formula <span class="caps">API</span>. That’s what I’ll discuss in the rest of the post.</p>
<h2>Logistic regression with formulas in statmodels</h2>
<p>The authors run a logistic regression to see if they can use a person’s
height and weight to determine their gender. I’m not really sure why
you’d run such a model (or how meaningful it is once you run it, given
how co-linear height and weight are), but it’s easy enough for
illustrating how to mechanically run a logistic regression and use it to
linearly separate groups.</p>
<p>The dataset contains variables <code>Height</code>, <code>Weight</code>, and <code>Gender</code>. The
latter is a string encoded either <code>Male</code> or <code>Female</code>. To run a logistic
regression, we’ll want to transform this to a numerical 0/1 variable. We
can do this a number of ways, but I’ll use the <code>map</code> method.</p>
<div class="highlight"><pre><span class="n">heights_weights</span><span class="p">[</span><span class="s">'Male'</span><span class="p">]</span> <span class="o">=</span> <span class="n">heights_weights</span><span class="p">[</span><span class="s">'Gender'</span><span class="p">]</span><span class="o">.</span><span class="n">map</span><span class="p">({</span><span class="s">'Male'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="s">'Female'</span><span class="p">:</span> <span class="mi">0</span><span class="p">})</span>
</pre></div>
<p>The <code>statstmodels.formula.api</code> module has a number of functions,
including <code>ols</code>, <code>logit</code>, and <code>glm</code>. If we import <code>logit</code> from the
module we can run a logistic regression easily.</p>
<div class="highlight"><pre><span class="n">male_logit</span> <span class="o">=</span> <span class="n">logit</span><span class="p">(</span><span class="n">formula</span> <span class="o">=</span> <span class="s">'Male \~ Height + Weight'</span><span class="p">,</span> <span class="n">df</span> <span class="o">=</span> <span class="n">heights_weights</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span>
<span class="k">print</span> <span class="n">male_logit</span><span class="o">.</span><span class="n">summary</span><span class="p">()</span>
</pre></div>
<p>With these results:</p>
<div class="highlight"><pre><span class="nx">Optimization</span> <span class="nx">terminated</span> <span class="nx">successfully</span><span class="p">.</span>
<span class="nx">Current</span> <span class="kd">function</span> <span class="nx">value</span><span class="o">:</span> <span class="mf">2091.297971</span>
<span class="nx">Iterations</span> <span class="mi">8</span>
<span class="nx">Logit</span> <span class="nx">Regression</span> <span class="nx">Results</span>
<span class="o">==============================================================================</span>
<span class="nx">Dep</span><span class="p">.</span> <span class="nx">Variable</span><span class="o">:</span> <span class="nx">Male</span> <span class="nx">No</span><span class="p">.</span> <span class="nx">Observations</span><span class="o">:</span> <span class="mi">10000</span>
<span class="nx">Model</span><span class="o">:</span> <span class="nx">Logit</span> <span class="nx">Df</span> <span class="nx">Residuals</span><span class="o">:</span> <span class="mi">9997</span>
<span class="nx">Method</span><span class="o">:</span> <span class="nx">MLE</span> <span class="nx">Df</span> <span class="nx">Model</span><span class="o">:</span> <span class="mi">2</span>
<span class="nb">Date</span><span class="o">:</span> <span class="nx">Thu</span><span class="p">,</span> <span class="mi">20</span> <span class="nx">Dec</span> <span class="mi">2012</span> <span class="nx">Pseudo</span> <span class="nx">R</span><span class="o">-</span><span class="nx">squ</span><span class="p">.</span><span class="o">:</span> <span class="mf">0.6983</span>
<span class="nx">Time</span><span class="o">:</span> <span class="mi">14</span><span class="o">:</span><span class="mi">41</span><span class="o">:</span><span class="mi">33</span> <span class="nx">Log</span><span class="o">-</span><span class="nx">Likelihood</span><span class="o">:</span> <span class="o">-</span><span class="mf">2091.3</span>
<span class="nx">converged</span><span class="o">:</span> <span class="nx">True</span> <span class="nx">LL</span><span class="o">-</span><span class="nx">Null</span><span class="o">:</span> <span class="o">-</span><span class="mf">6931.5</span>
<span class="nx">LLR</span> <span class="nx">p</span><span class="o">-</span><span class="nx">value</span><span class="o">:</span> <span class="mf">0.000</span>
<span class="o">==============================================================================</span>
<span class="nx">coef</span> <span class="nx">std</span> <span class="nx">err</span> <span class="nx">z</span> <span class="nx">P</span><span class="err">\</span><span class="o">>|</span><span class="nx">z</span><span class="o">|</span> <span class="cp">[</span><span class="mf">95.0</span><span class="o">%</span> <span class="nx">Conf.</span> <span class="nx">Int.</span><span class="cp">]</span>
<span class="o">------------------------------------------------------------------------------</span>
<span class="nx">Intercept</span> <span class="mf">0.6925</span> <span class="mf">1.328</span> <span class="mf">0.521</span> <span class="mf">0.602</span> <span class="o">-</span><span class="mf">1.911</span> <span class="mf">3.296</span>
<span class="nx">Height</span> <span class="o">-</span><span class="mf">0.4926</span> <span class="mf">0.029</span> <span class="o">-</span><span class="mf">17.013</span> <span class="mf">0.000</span> <span class="o">-</span><span class="mf">0.549</span> <span class="o">-</span><span class="mf">0.436</span>
<span class="nx">Weight</span> <span class="mf">0.1983</span> <span class="mf">0.005</span> <span class="mf">38.663</span> <span class="mf">0.000</span> <span class="mf">0.188</span> <span class="mf">0.208</span>
<span class="o">==============================================================================</span>
</pre></div>
<p>Just for fun, we can also run the logistic regression via a <span class="caps">GLM</span> with a
binomial family and logit link. This is similar to how I’d run it in R.</p>
<div class="highlight"><pre><span class="n">male_glm_logit</span> <span class="o">=</span> <span class="n">glm</span><span class="p">(</span><span class="s">'Male \~ Height + Weight'</span><span class="p">,</span> <span class="n">df</span> <span class="o">=</span>
<span class="n">heights_weights</span><span class="p">,</span>
<span class="n">family</span> <span class="o">=</span> <span class="n">sm</span><span class="o">.</span><span class="n">families</span><span class="o">.</span><span class="n">Binomial</span><span class="p">(</span><span class="n">sm</span><span class="o">.</span><span class="n">families</span><span class="o">.</span><span class="n">links</span><span class="o">.</span><span class="n">logit</span><span class="p">))</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span>
<span class="k">print</span> <span class="n">male_glm_logit</span><span class="o">.</span><span class="n">summary</span><span class="p">()</span>
</pre></div>
<p>And the results are the same:</p>
<div class="highlight"><pre>Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: Male No. Observations: 10000
Model: GLM Df Residuals: 9997
Model Family: Binomial Df Model: 2
Link Function: logit Scale: 1.0
Method: IRLS Log-Likelihood: -2091.3
Date: Thu, 20 Dec 2012 Deviance: 4182.6
Time: 14:41:37 Pearson chi2: 9.72e+03
No. Iterations: 8
==============================================================================
coef std err t P\>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 0.6925 1.328 0.521 0.602 -1.911 3.296
Height -0.4926 0.029 -17.013 0.000 -0.549 -0.436
Weight 0.1983 0.005 38.663 0.000 0.188 0.208
==============================================================================
</pre></div>
<p>Now we can use the coefficients to plot a separating line in height-weight space.</p>
<div class="highlight"><pre><span class="n">logit_pars</span> <span class="o">=</span> <span class="n">male_logit</span><span class="o">.</span><span class="n">params</span>
<span class="n">intercept</span> <span class="o">=</span> <span class="o">-</span><span class="n">logit_pars</span><span class="p">[</span><span class="s">'Intercept'</span><span class="p">]</span> <span class="o">/</span> <span class="n">logit_pars</span><span class="p">[</span><span class="s">'Weight'</span><span class="p">]</span>
<span class="n">slope</span> <span class="o">=</span> <span class="o">-</span><span class="n">logit_pars</span><span class="p">[</span><span class="s">'Height'</span><span class="p">]</span> <span class="o">/</span> <span class="n">logit_pars</span><span class="p">[</span><span class="s">'Weight'</span><span class="p">]</span>
</pre></div>
<p>Let’s plot the data, color-coded by sex, and the separating line.</p>
<div class="highlight"><pre><span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span> <span class="o">=</span> <span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">8</span><span class="p">))</span>
<span class="c"># Women points (coral)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">heights_f</span><span class="p">,</span> <span class="n">weights_f</span><span class="p">,</span> <span class="s">'.'</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Female'</span><span class="p">,</span>
<span class="n">mfc</span> <span class="o">=</span> <span class="s">'None'</span><span class="p">,</span> <span class="n">mec</span><span class="o">=</span><span class="s">'coral'</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="o">.</span><span class="mi">4</span><span class="p">)</span>
<span class="c"># Men points (blue)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">heights_m</span><span class="p">,</span> <span class="n">weights_m</span><span class="p">,</span> <span class="s">'.'</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Male'</span><span class="p">,</span>
<span class="n">mfc</span> <span class="o">=</span> <span class="s">'None'</span><span class="p">,</span> <span class="n">mec</span><span class="o">=</span><span class="s">'steelblue'</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="o">.</span><span class="mi">4</span><span class="p">)</span>
<span class="c"># The separating line</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">array</span><span class="p">([</span><span class="mi">50</span><span class="p">,</span> <span class="mi">80</span><span class="p">]),</span> <span class="n">intercept</span> <span class="o">+</span> <span class="n">slope</span> <span class="o">*</span> <span class="n">array</span><span class="p">([</span><span class="mi">50</span><span class="p">,</span> <span class="mi">80</span><span class="p">]),</span>
<span class="s">'-'</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'#461B7E'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Height (in.)'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Weight (lbs.)'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="s">'upper left'</span><span class="p">)</span>
</pre></div>
<p><a href="../images/logit_hw_sex_separate.png">
<img src="../images/logit_hw_sex_separate.png" width=450px />
</a></p>
<h2>Conclusion</h2>
<p>There are several more examples using Patsy formulas with statsmodels
functions in later chapters. If you’re accustomed to R’s formula
notation, the transition from running models in R to running models in
statsmodels is easy. One of the annoying things in Python versus R is
the need to pull arrays out of pandas dataframes, because the functions
you want to apply to the data (say estimating models, or plotting) don’t
interface with the dataframe, but instead numpy arrays. It’s not
terrible, but it adds a layer of friction in the analysis. So it’s great
that statsmodels is starting to integrate well with pandas.</p>Machine Learning for Hackers Chapter 3: Naive Bayes Text Classification2012-12-20T04:20:00-05:00Carltag:slendermeans.org,2012-12-20:ml4h-ch3.html<p>I realize I haven’t blogged about the rest of chapter 2 yet. I’ll get
back to that, but chapter 3 is on my mind today. If you haven’t seen
them yet, IPython notebooks up to chapter 9 are all up in the <a href="https://github.com/carljv/Will_it_Python/tree/master/MLFH">Github
repo</a>. To view them online, you can check the links on <a href="../category/will-it-python.html">this page</a>.</p>
<p>Chapter 3 is about text classification. The authors build a classifier
that will identify whether an e-mail is spam or not (“ham”) based on the
content of the e-mail’s message. I won’t go into much detail on how the
Naive Bayes classifier they use works (beyond what’s evident in the
code). The theory is described well in the book and many other places.
I’m just going to discuss implementation, assuming you know how the
classifier works in theory. The Python code for this project relies
heavily on the <span class="caps">NLTK</span> (Natural Language Toolkit) package, which is a
comprehensive library that includes functions for doing <span class="caps">NLP</span> and text
analysis, as well as an array of benchmark text corpora to use them on.
If you want to go deep into this stuff, two good resources are:</p>
<ul>
<li><a href="http://shop.oreilly.com/product/9780596516499.do"><em>Natural Language Processing with Python</em></a> by S. Bird, E. Klein,
and E. Loper; and</li>
<li><a href="http://www.packtpub.com/python-text-processing-nltk-20-cookbook/book"><em>Python Text Processing with <span class="caps">NLTK</span> 2.0 Cookbook</em></a> by J. Perkins</li>
</ul>
<h2>Two versions of the program</h2>
<p>I’ve coded up two different versions of this chapter. The first,
<a href="http://nbviewer.ipython.org/urls/raw.github.com/carljv/Will_it_Python/master/MLFH/CH3/ch3.ipynb">here</a>, tries to follow the book relatively closely. The general
procedure they use is:</p>
<ol>
<li>Parse and tokenize the e-mails</li>
<li>Create a term-document matrix of the e-mails</li>
<li>Calculate features of the training e-mails using the term-document matrix</li>
<li>Train the classifier on these features</li>
<li>Test the classifier on other sets of spam and ham e-mails</li>
</ol>
<p>I’m not going to discuss this version in much detail, but you should
take a look at the notebook if you’re interested. Two big takeaways from
this are:</p>
<ul>
<li>
<p>Python lacks a good term-document matrix tool.** I was surprised to find that <span class="caps">NLTK</span>, which has so much functionality including helper functions like <code>FreqDist</code>, doesn’t have a function for making term-document matrices similar to the <code>tdm</code> function in R’s <code>tm</code> package. There is a Python module called <code>textmining</code> (which you can install with pip) that does have a term-document matrix function, but it’s pretty rudimentary. What you’ll see in this chapter is that I’ve coded up a term-document matrix function that uses the one in <code>textmining</code> but adds some bells and whistles, and returns the <span class="caps">TDM</span> as a
(typically sparse) pandas dataframe.</p>
</li>
<li>
<p>The authors’ classifier suffers from numerical errors.** The Naive
Bayes classifier calcalates the probability that a message is spam by
calculating the probability that the message’s terms occur in a spam
message. So if the message is just “buy viagra”, and “buy” occurs in 75%
of the training spam, and “viagra” occurs in 50% of the training spam,
then the classifier assigns this a ‘spam’ probability of .75 * .50 =
37.5%. The problem with this calculation is that there are typically
many terms, and the probabilities are often small, so their product can
end up smaller than machine precision and underflow to zero. The way
around this is to take the sum of the log probabilities (so log(.75) +
log(.25)). The authors don’t do this, though, and it’s apparent that
they end up with underflow errors. See, for example, the code output on
page 89. This is also what leads to them having essentially the same
error rates for “hard” ham as they do for “easy” ham in the tables on
pages 89 and 92. Once you fix this problem, it turns out the classifier
is actually much better for spam and easy ham than it appears in the
book, but it’s way worse for hard ham.</p>
</li>
</ul>
<p>I’m going to focus on the second version of the program, though, in the
notebook called <code>ch3_nltk.ipynb</code>. You can view it online <a href="http://nbviewer.ipython.org/urls/raw.github.com/carljv/Will_it_Python/master/MLFH/CH3/ch3_nltk.ipynb">here</a>.In
this version, I use <span class="caps">NLTK</span>’s built-in <code>NaiveBayesClassifier</code> function, and
avoid creating the <span class="caps">TDM</span> (which isn’t really used for much in the original
code anyway).</p>
<h2>Building a Naive Bayes spam classifier with <span class="caps">NLTK</span></h2>
<p>I’ll follow the same logic as the program from chapter 3, but I’ll do so
with a workflow more suited to <span class="caps">NLTK</span>’s functions. So instead of creating
a term-document matrix, and building my own Naive Bayes classifier, Ill
build a <code>features → label</code> association for each training e-mail, and
feed a list of these to <span class="caps">NLTK</span>’s <code>NaiveBayesClassifier</code> function.</p>
<h3>Extracting word features from the e-mail messages</h3>
<p>The program begins with some simple code that loads the e-mail files
from the directories, extracts the “message” or body of the e-mail, and
loads all those messages into a list. This follows the book’s code
pretty closely, and we end up with training and testing lists of spam,
easy ham, and hard ham. The training data will be the e-mails in the
training directories for spam and easy ham. (So, like in the book, we’re
not training on any hard ham.)</p>
<p>Each e-mail in our classifier’s training data will have a label (“spam”
or “ham”) and a feature set. For this application, we’re just going to
use a feature set that is just a set of the unique words in the e-mail.
Below, I’ll turn this into a dictionary to feed into the
<code>NaiveBayesClassifier</code>, but first, let’s get the set.</p>
<blockquote>
<p><strong>Note:</strong> This is a similar to a “bag-of-words” model, in that it
doesn’t care about word order or other semantic information. But a
“bag-of-words” usually considers the frequency of the word within the
document (like a histogram of the words), whereas we’re only concerned
with whether it’s in an e-mail, not how often it occurs.*</p>
</blockquote>
<h3>Parsing and tokenizing the e-mails</h3>
<p>I’m going to use <span class="caps">NLTK</span>’s <code>wordpunct_tokenize</code> function to break the
message into tokens. This splits tokens at white space and (most)
punctuation marks, and returns the punctuation along with the tokens on
each side. So <code>"I don't know. Do you?"</code> becomes
<code>["I", "don","'", "t", "know", ".", "Do", "you", "?"]</code>.</p>
<p>If you look through some of the training e-mails in
<code>train_spam_messages</code> and <code>train_ham_messages</code>, you’ll notice a few
features that make extracting words tricky.</p>
<p>First, there are a couple of odd text artefacts. The string ‘3D’ shows
up in strange places in <span class="caps">HTML</span> attributes and other places, and we’ll
remove these. Furthermore there seem to be some mid-word line wraps
flagged with an ‘=’ where the word is broken across lines. For example,
the word ‘apple’ might be split across lines like ‘app=\nle’. We want
to strip these out so we can recover ‘apple’. We’ll want to deal with
all these first, before we apply the tokenizer.</p>
<p>Second, there’s a lot of <span class="caps">HTML</span> in the messages. We’ll have to decide
first whether we want to keep <span class="caps">HTML</span> info in our set of words. If we do,
and we apply <code>wordpunct_tokenize</code> to some <span class="caps">HTML</span>, for example:</p>
<div class="highlight"><pre>"<span class="nt"><HEAD></HEAD><BODY></span><span class="c"><!-- Comment --></span>"
</pre></div>
<p>would tokenize to:</p>
<div class="highlight"><pre>["<span class="err"><</span>", "HEAD", "><span class="err"><</span>/", "HEAD", "><span class="err"><</span>", "BODY", "><span class="c"><!--", "Comment", "--></span>"]
</pre></div>
<p>So if we drop the punctuation tokens, and get the unique set of what
remains, we’d have <code>{"HEAD", "BODY", "Comment"}</code>, which seems like what
we’d want. For example, it’s nice that this method doesn’t make,
<code><HEAD></code> and <code></HEAD></code> separate words in our set, but just captures the
existence of this tag with the term <code>"HEAD"</code>. It might be a problem that
we won’t distinguish between the <span class="caps">HTML</span> tag <code><HEAD></code> and “head” used as an
English word in the message. But for the moment I’m willing to bet that
sort of conflation won’t have a big effect on the classifier.</p>
<p>If we don’t want to count <span class="caps">HTML</span> information in our set of words, we can
set <code>strip_html</code> to <code>True</code>, and we’ll take all the <span class="caps">HTML</span> tags out before tokenizing.</p>
<p>Lastly we’ll strip out any “stopwords” from the set. Stopwords are
highly common, therefore low information words, like “a”, “the”, “he”,
etc. Below I’ll use <code>stopwords</code>, downloaded from <span class="caps">NLTK</span>’s corpus library,
with a minor modifications to deal with this. (In other programs I’ve
used the stopwords exported from R’s <code>tm</code> package.)</p>
<p>Note that because our tokenizer splits contractions (“she’ll” → “she”,
“ll”), we’d like to drop the ends (“ll”). Some of these may be picked up
in <span class="caps">NLTK</span>’s <code>stopwords</code> list, others we’ll manually add. It’s an
imperfect, but easy solution. There are more sophisticated ways of
dealing with this which are overkill for our purposes.</p>
<p>Tokenizing, as perhaps you can tell, is a non-trivial operation. <span class="caps">NLTK</span>
has a host of other tokenizing functions of varying sophistication, and
even lets you define your own tokenizing rule using regex.</p>
<div class="highlight"><pre><span class="k">def</span> <span class="nf">get_msg_words</span><span class="p">(</span><span class="n">msg</span><span class="p">,</span> <span class="n">stopwords</span> <span class="o">=</span> <span class="p">[],</span> <span class="n">strip_html</span> <span class="o">=</span> <span class="bp">False</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd">Returns the set of unique words contained in an e-mail message.</span>
<span class="sd">Excludes</span>
<span class="sd">any that are in an optionally-provided list.</span>
<span class="sd">NLTK's 'wordpunct' tokenizer is used, and this will break contractions.</span>
<span class="sd">For example, don't -&gt; (don, ', t). Therefore, it's advisable to</span>
<span class="sd">supply</span>
<span class="sd">a stopwords list that includes contraction parts, like 'don' and 't'.</span>
<span class="sd">'''</span>
<span class="c"># Strip out weird '3D' artefacts.</span>
<span class="n">msg</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="s">'3D'</span><span class="p">,</span> <span class="s">''</span><span class="p">,</span> <span class="n">msg</span><span class="p">)</span>
<span class="c"># Strip out html tags and attributes and html character codes,</span>
<span class="c"># like '&amp;nbsp;' and '&amp;lt;'.</span>
<span class="k">if</span> <span class="n">strip_html</span><span class="p">:</span>
<span class="n">msg</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="s">'&lt;(.|</span><span class="se">\\</span><span class="s">n)\*?&gt;'</span><span class="p">,</span> <span class="s">' '</span><span class="p">,</span> <span class="n">msg</span><span class="p">)</span>
<span class="n">msg</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="s">'&amp;</span><span class="se">\\</span><span class="s">w+;'</span><span class="p">,</span> <span class="s">' '</span><span class="p">,</span> <span class="n">msg</span><span class="p">)</span>
<span class="c"># wordpunct_tokenize doesn't split on underscores. We don't</span>
<span class="c"># want to strip them, since the token first_name may be informative</span>
<span class="c"># moreso than 'first' and 'name' apart. But there are tokens with</span>
<span class="nb">long</span>
<span class="c"># underscore strings (e.g. 'name_'). We'll just</span>
<span class="n">replace</span> <span class="n">the</span>
<span class="c"># multiple underscores with a single one, since 'name_' is</span>
<span class="n">probably</span>
<span class="c"># not distinct from 'name_' or 'name_' in identifying spam.</span>
<span class="n">msg</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="s">'_+'</span><span class="p">,</span> <span class="s">'_'</span><span class="p">,</span> <span class="n">msg</span><span class="p">)</span>
<span class="c"># Note, remove '=' symbols before tokenizing, since these</span>
<span class="c"># sometimes occur within words to indicate, e.g., line-wrapping.</span>
<span class="n">msg_words</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span><span class="n">wordpunct_tokenize</span><span class="p">(</span><span class="n">msg</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s">'=</span><span class="se">\\</span><span class="s">n'</span><span class="p">,</span> <span class="s">''</span><span class="p">)</span><span class="o">.</span><span class="n">lower</span><span class="p">()))</span>
<span class="c"># Get rid of stopwords</span>
<span class="n">msg_words</span> <span class="o">=</span> <span class="n">msg_words</span><span class="o">.</span><span class="n">difference</span><span class="p">(</span><span class="n">stopwords</span><span class="p">)</span>
<span class="c"># Get rid of punctuation tokens, numbers, and single letters.</span>
<span class="n">msg_words</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">msg_words</span> <span class="k">if</span> <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s">'[a-zA-Z]'</span><span class="p">,</span> <span class="n">w</span><span class="p">)</span> <span class="ow">and</span>
<span class="nb">len</span><span class="p">(</span><span class="n">w</span><span class="p">)</span> <span class="o">&</span><span class="n">gt</span><span class="p">;</span> <span class="mi">1</span><span class="p">]</span>
<span class="k">return</span> <span class="n">msg_words</span>
</pre></div>
<h3>Making a <code>(features, label)</code> list</h3>
<p>The <code>NaiveBayesClassifier</code> function trains on data that’s of the form
<code>[(features1, label1), features2, label2), ..., (featuresN, labelN)]</code>
where <code>featuresi</code> is a dictionary of features for e-mail <code>i</code> and
<code>labeli</code> is the label for e-mail <code>i</code> (<code>spam</code> or <code>ham</code>).</p>
<p>The function <code>features_from_messages</code> iterates through the messages
creating this list, but calls an outside function to create the features
for each e-mail. This makes the function modular in case we decide to
try out some other method of extracting features from the e-mails
besides the set of word. It then combines the features to the e-mail’s
label in a tuple and adds the tuple to the list.</p>
<p>The <code>word_indicator</code> function calls <code>get_msg_words()</code> to get an e-mail’s
words as a set, then creates a dictionary with entries <code>{word: True}</code>
for each word in the set. This is a little counter-intuitive (since we
don’t have <code>{word: False}</code> entries for words not in the set) but
<code>NaiveBayesClassifier</code> knows how to handle it.</p>
<div class="highlight"><pre><span class="k">def</span> <span class="nf">features_from_messages</span><span class="p">(</span><span class="n">messages</span><span class="p">,</span> <span class="n">label</span><span class="p">,</span> <span class="n">feature_extractor</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Make a (features, label) tuple for each message in a list of a certain,</span>
<span class="sd"> label of e-mails ('spam', 'ham') and return a list of these tuples.</span>
<span class="sd"> Note every e-mail in 'messages' should have the same label.</span>
<span class="sd"> '''</span>
<span class="n">features_labels</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">msg</span> <span class="ow">in</span> <span class="n">messages</span><span class="p">:</span>
<span class="n">features</span> <span class="o">=</span> <span class="n">feature_extractor</span><span class="p">(</span><span class="n">msg</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
<span class="n">features_labels</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">features</span><span class="p">,</span> <span class="n">label</span><span class="p">))</span>
<span class="k">return</span> <span class="n">features_labels</span>
<span class="k">def</span> <span class="nf">word_indicator</span><span class="p">(</span><span class="n">msg</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Create a dictionary of entries {word: True} for every unique</span>
<span class="sd"> word in a message.</span>
<span class="sd"> Note **kwargs are options to the word-set creator,</span>
<span class="sd"> get_msg_words().</span>
<span class="sd"> '''</span>
<span class="n">features</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">list</span><span class="p">)</span>
<span class="n">msg_words</span> <span class="o">=</span> <span class="n">get_msg_words</span><span class="p">(</span><span class="n">msg</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
<span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">msg_words</span><span class="p">:</span>
<span class="n">features</span><span class="p">[</span><span class="n">w</span><span class="p">]</span> <span class="o">=</span> <span class="bp">True</span>
<span class="k">return</span> <span class="n">features</span>
</pre></div>
<h2>Training and evaluating the classifier</h2>
<p>With those functions defined, we can apply them to the training and
testing spam and ham messages.</p>
<div class="highlight"><pre><span class="k">def</span> <span class="nf">make_train_test_sets</span><span class="p">(</span><span class="n">feature_extractor</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Make (feature, label) lists for each of the training</span>
<span class="sd"> and testing lists.</span>
<span class="sd"> '''</span>
<span class="n">train_spam</span> <span class="o">=</span> <span class="n">features_from_messages</span><span class="p">(</span><span class="n">train_spam_messages</span><span class="p">,</span> <span class="s">'spam'</span><span class="p">,</span>
<span class="n">feature_extractor</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
<span class="n">train_ham</span> <span class="o">=</span> <span class="n">features_from_messages</span><span class="p">(</span><span class="n">train_easyham_messages</span><span class="p">,</span> <span class="s">'ham'</span><span class="p">,</span>
<span class="n">feature_extractor</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
<span class="n">train_set</span> <span class="o">=</span> <span class="n">train_spam</span> <span class="o">+</span> <span class="n">train_ham</span>
<span class="n">test_spam</span> <span class="o">=</span> <span class="n">features_from_messages</span><span class="p">(</span><span class="n">test_spam_messages</span><span class="p">,</span> <span class="s">'spam'</span><span class="p">,</span>
<span class="n">feature_extractor</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
<span class="n">test_ham</span> <span class="o">=</span> <span class="n">features_from_messages</span><span class="p">(</span><span class="n">test_easyham_messages</span><span class="p">,</span> <span class="s">'ham'</span><span class="p">,</span>
<span class="n">feature_extractor</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
<span class="n">test_hardham</span> <span class="o">=</span> <span class="n">features_from_messages</span><span class="p">(</span><span class="n">test_hardham_messages</span><span class="p">,</span>
<span class="s">'ham'</span><span class="p">,</span>
<span class="n">feature_extractor</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
<span class="k">return</span> <span class="n">train_set</span><span class="p">,</span> <span class="n">test_spam</span><span class="p">,</span> <span class="n">test_ham</span><span class="p">,</span> <span class="n">test_hardham</span>
</pre></div>
<p>Notice that the training set we’ll use to train the classifier combines
both the spam and easy ham training sets (since we need both types of
e-mail to train it).</p>
<p>Finally, let’s write a function to train the classifier and check how
accurate it is on the test data.</p>
<div class="highlight"><pre><span class="k">def</span> <span class="nf">check_classifier</span><span class="p">(</span><span class="n">feature_extractor</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Train the classifier on the training spam and ham, then check its</span>
<span class="sd"> accuracy</span>
<span class="sd"> on the test data, and show the classifier's most informative features.</span>
<span class="sd"> '''</span>
<span class="c"># Make training and testing sets of (features, label) data</span>
<span class="n">train_set</span><span class="p">,</span> <span class="n">test_spam</span><span class="p">,</span> <span class="n">test_ham</span><span class="p">,</span> <span class="n">test_hardham</span> <span class="o">=</span> \\
<span class="n">make_train_test_sets</span><span class="p">(</span><span class="n">feature_extractor</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
<span class="c"># Train the classifier on the training set</span>
<span class="n">classifier</span> <span class="o">=</span> <span class="n">NaiveBayesClassifier</span><span class="o">.</span><span class="n">train</span><span class="p">(</span><span class="n">train_set</span><span class="p">)</span>
<span class="c"># How accurate is the classifier on the test sets?</span>
<span class="k">print</span> <span class="p">(</span><span class="s">'Test Spam accuracy: {0:.2f}%'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="mi">100</span> \<span class="o">*</span> <span class="n">nltk</span><span class="o">.</span><span class="n">classify</span><span class="o">.</span><span class="n">accuracy</span><span class="p">(</span><span class="n">classifier</span><span class="p">,</span> <span class="n">test_spam</span><span class="p">)))</span>
<span class="k">print</span> <span class="p">(</span><span class="s">'Test Ham accuracy: {0:.2f}%'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="mi">100</span> \<span class="o">*</span> <span class="n">nltk</span><span class="o">.</span><span class="n">classify</span><span class="o">.</span><span class="n">accuracy</span><span class="p">(</span><span class="n">classifier</span><span class="p">,</span> <span class="n">test_ham</span><span class="p">)))</span>
<span class="k">print</span> <span class="p">(</span><span class="s">'Test Hard Ham accuracy: {0:.2f}%'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="mi">100</span> \<span class="o">*</span> <span class="n">nltk</span><span class="o">.</span><span class="n">classify</span><span class="o">.</span><span class="n">accuracy</span><span class="p">(</span><span class="n">classifier</span><span class="p">,</span> <span class="n">test_hardham</span><span class="p">)))</span>
<span class="c"># Show the top 20 informative features</span>
<span class="k">print</span> <span class="n">classifier</span><span class="o">.</span><span class="n">show_most_informative_features</span><span class="p">(</span><span class="mi">20</span><span class="p">)</span>
</pre></div>
<p>The function also prints out the results of <code>NaiveBayesClassifiers</code><span class="quo">‘</span>s
handy <code>show_most_informative_features</code> method. This shows which features
are most unique to one label or another. For example, if “viagra” shows
up in 500 of the spam e-mails, but only 2 of the “ham” e-mails in the
training set, then the method will show that “viagra” is one of the most
informative features with a <code>spam:ham</code> ratio of 250:1.</p>
<p>So how do we do? I’ll check two versions. The first uses the <span class="caps">HTML</span> info
in the e-mails in the classifier:</p>
<div class="highlight"><pre><span class="n">check_classifier</span><span class="p">(</span><span class="n">word_indicator</span><span class="p">,</span> <span class="n">stopwords</span> <span class="o">=</span> <span class="n">sw</span><span class="p">)</span>
</pre></div>
<p>Which gives:</p>
<div class="highlight"><pre>Test Spam accuracy: 98.71%
Test Ham accuracy: 97.07%
Test Hard Ham accuracy: 13.71%
Most Informative Features
align = True spam : ham = 119.7 : 1.0
tr = True spam : ham = 115.7 : 1.0
td = True spam : ham = 111.7 : 1.0
arial = True spam : ham = 107.7 : 1.0
cellpadding = True spam : ham = 97.0 : 1.0
cellspacing = True spam : ham = 94.3 : 1.0
img = True spam : ham = 80.3 : 1.0
bgcolor = True spam : ham = 67.4 : 1.0
href = True spam : ham = 67.0 : 1.0
sans = True spam : ham = 62.3 : 1.0
colspan = True spam : ham = 61.0 : 1.0
font = True spam : ham = 61.0 : 1.0
valign = True spam : ham = 60.3 : 1.0
br = True spam : ham = 59.6 : 1.0
verdana = True spam : ham = 57.7 : 1.0
nbsp = True spam : ham = 57.4 : 1.0
color = True spam : ham = 54.4 : 1.0
ff0000 = True spam : ham = 53.0 : 1.0
ffffff = True spam : ham = 50.6 : 1.0
border = True spam : ham = 49.6 : 1.0
</pre></div>
<p>The classifier does a really good job for spam and easy ham, but it’s
pretty miserable for hard ham. This may be because hard ham messages
tend to be <span class="caps">HTML</span>-formatted while easy ham messages aren’t. Note how much
the classifier relies on <span class="caps">HTML</span> information—nearly all the most
informative features are <span class="caps">HTML</span>-related.</p>
<p>If we try just using the text of the messages, without the <span class="caps">HTML</span>
information, we lose a tiny bit of accuracy in identifying spam but do
much better with the hard ham.</p>
<div class="highlight"><pre><span class="n">check_classifier</span><span class="p">(</span><span class="n">word_indicator</span><span class="p">,</span> <span class="n">stopwords</span> <span class="o">=</span> <span class="n">sw</span><span class="p">,</span> <span class="n">strip_html</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
</pre></div>
<p>shows</p>
<div class="highlight"><pre>Test Spam accuracy: 96.64%
Test Ham accuracy: 98.64%
Test Hard Ham accuracy: 56.05%
Most Informative Features
dear = True spam : ham = 41.7 : 1.0
aug = True ham : spam = 38.3 : 1.0
guaranteed = True spam : ham = 35.0 : 1.0
assistance = True spam : ham = 29.7 : 1.0
groups = True ham : spam = 27.9 : 1.0
mailings = True spam : ham = 25.0 : 1.0
sincerely = True spam : ham = 23.0 : 1.0
fill = True spam : ham = 23.0 : 1.0
mortgage = True spam : ham = 21.7 : 1.0
sir = True spam : ham = 21.0 : 1.0
sponsor = True ham : spam = 20.3 : 1.0
article = True ham : spam = 20.3 : 1.0
assist = True spam : ham = 19.0 : 1.0
income = True spam : ham = 18.6 : 1.0
tue = True ham : spam = 18.3 : 1.0
mails = True spam : ham = 18.3 : 1.0
iso = True spam : ham = 17.7 : 1.0
admin = True ham : spam = 17.7 : 1.0
monday = True ham : spam = 17.7 : 1.0
earn = True spam : ham = 17.0 : 1.0
</pre></div>
<p>Check out the most informative features; they make a lot of sense. Note
mostly spammers address you with “Dear” and “Sir” and sign off with
“Sincerely,”. (Probably those Nigerian princes; they tend to be polite.)
Other spam flags that gel with our intuition are “guaranteed”,
“mortgage”, “assist”, “assistance”, and “income.”</p>
<h2>Conclusion</h2>
<p>So we’ve built a simple but decent spam classifier with just a tiny
amount of code. <span class="caps">NLTK</span> provides a wealth of tools for doing this sort of
thing more seriously including ways to extract more sophisticated
features and more complex classifiers.</p>Better typography for IPython notebooks2012-12-05T05:34:00-05:00Carltag:slendermeans.org,2012-12-05:better-typography-for-ipython-notebooks.html<p><em>(Warning: ignorant rant coming up)</em></p>
<p>Like everyone else who’s ever used it, I love the <a href="http://ipython.org/ipython-doc/rel-0.13.1/interactive/htmlnotebook.html">IPython
notebook.</a> It’s not only an awesomely productive environment to work
in, it’s also the most powerful weapon in the Python evangelist’s
arsenal (suck it, Matlab).</p>
<p>I also think it’s not hard to imagine a world where scientific papers
are all just literate programs. And the notebook is probably one of the
best tools for literate programming around in any language. The
intregration of markdown and LaTeX/MathJax into the notebook is just fantastic.</p>
<p>But it does have one weakness as a literate programming tool. The
default typography is ugly as sin.</p>
<p>There are several issues, but two major ones are easily fixable.</p>
<h2>Long lines</h2>
<p>By far the biggest issue is that the text and input cells extend to 100%
of the window width. Most people keep their browser windows open wider
than is comfortable reading width, so you end up with long hard-to-read
lines of text in the markdown cells.</p>
<p>And for the code, it would be nice to have the code cell discourage you
from long lines. The variable width cells don’t. I’m an 80-character
anal retentive, and even I have trouble in the notebook getting a sense
of when a line is too long.</p>
<p>When you write a script in a text editor, there’s lots of previous code
in the viewable window, so your eye gets a sense of the ‘right-margin’
of the code. (Not to mention many editors will indicate the 80- or
whatever-character column, so you know exactly when to break). But in
the notebook, your code is typically broken up into smaller blocks, and
those blocks are interspersed with output and other cells. It’s hard to
get a visual sense of the right margin.</p>
<h2>Ugly fonts</h2>
<p>Text and markdown cells are typically rendered in Helvetica or Arial.
Helvetica is a fine font, obviously, but it’s not really suitable for
paragraphs of text (how many books, magazines, newspapers, or academic
papers do you see with body text typeset in Helvetica?). And combined
with the small size and long lines makes it hard to read and just plain
ugly. I don’t think I have to say anything about Arial.</p>
<p>The way I use the notebook—with markdown cells used for long stretches
of explanatory text and result interpretation—it’s better to have the
text cells render in a serif font. This way it stands out from the code
and output cells more. Serif fonts also have more distinctive italics,
and integrate better with LaTeX/MathJax math.</p>
<p>Code cells and interpreter output cells render in whatever your default
monospace font is. That’s typically Courier or Courier New. This is
fine, but really, this is the 21st century—we can do <a href="http://blogs.adobe.com/typblography/2012/09/source-code-pro.html">a lot better</a>.</p>
<h2>Update: one more thing</h2>
<p>I realize I’ve made one other change that I think is important. The
default ordered list in the notebook uses roman numerals (I, <span class="caps">II</span>, <span class="caps">III</span>,
…). I almost always want arabic numerals (1, 2, 3, …) instead. We
can change this in the file <code>renderedhtml.css</code> with</p>
<div class="highlight"><pre><span class="nc">.rendered_html</span> <span class="nt">ol</span> <span class="p">{</span><span class="k">list-style</span><span class="o">:</span><span class="k">decimal</span><span class="p">;</span> <span class="k">margin</span><span class="o">:</span> <span class="m">1em</span> <span class="m">2em</span><span class="p">;}</span>
</pre></div>
<p>(Also check the comments for other, and typically better ways to make
changes.) You can also modify sub-levels <code>ol ol</code>, <code>ol ol ol</code>, etc.
Ideally I’d like to have nested numbers 1.1, 1.1.1, but this isn’t
straightforward so I haven’t implemented it. If anyone has tips, I’d be
thrilled to hear them.</p>
<h2>Fixing it (locally, at least)</h2>
<p><em>(Warning: I don’t know what I’m doing. Don’t make any of these changes,
or any others, without backing up the files first.)</em></p>
<blockquote>
<p>(<strong>Update</strong>: Matthias Bussonnier has an <a href="http://http://nbviewer.ipython.org/urls/raw.github.com/Carreau/posts/master/Blog1.ipynb">informative post</a>
showing the right way to make these changes. If you make the <span class="caps">CSS</span> changes
I describe below, do it the way he advises, not through the files I
describe here.)</p>
</blockquote>
<p>The notebook is served through the browser, so its frontend is basically
just <span class="caps">HTML</span>, Javascript, and <span class="caps">CSS</span>. The typography and appearance of the
notebook is nearly all driven by <span class="caps">CSS</span> files located where IPython is
stored on your system. This will differ based on your <span class="caps">OS</span> and your Python
distribution. On my mac, with the AnacondaCE distribution, the
stylesheets are located
in <code>/Users/cvogel/anaconda/lib/python2.7/site-packages/IPython/frontend/html/notebook/static</code>.
There are several subfolders there, including one called <code>/css</code> and
<code>/codemirror</code>. You can also take a look at the stylesheet files by
firing up a notebook, and using your browser’s inspector. If your
browser (e.g. Chrome) lets you edit stylesheets on the fly in the
inspector, you can try out changes relatively safely.</p>
<p>Here are the edits I’ve made on my system to address the issues above.
First, in the /css folder, in the file called notebook.css</p>
<p>1. Set code input cells to be narrower (code that runs past the width
will be invisible). I try to set this for about 80 characters plus some
buffer. There’s not way to set width as number of characters in <span class="caps">CSS</span>, so
you may have to experiment to see what ex-widths works with your font.</p>
<div class="highlight"><pre><span class="nt">div</span><span class="nc">.input</span> <span class="p">{</span>
<span class="k">width</span><span class="o">:</span> <span class="m">105ex</span><span class="p">;</span> <span class="c">/* about 80 chars + buffer */</span>
<span class="o">...</span>
<span class="p">}</span>
</pre></div>
<p>2. Fixing markdown/text cells. I make changes to the font, width, and
linespacing. I’m using Charis <span class="caps">SIL</span>, a font based on the classic Bitstream
Charter, and freely available <a href="http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=CharisSILfont">here</a>. Shortening the lines and adding
some line space (120% to 150% of point size is usually a good range) for legibility.</p>
<div class="highlight"><pre><span class="nt">div</span><span class="nc">.text_cell</span> <span class="p">{</span>
<span class="k">width</span><span class="o">:</span> <span class="m">105ex</span> <span class="c">/* instead of 100%, */</span>
<span class="o">...</span>
<span class="p">}</span>
<span class="nt">div</span><span class="nc">.text_cell_render</span> <span class="p">{</span>
<span class="c">/*font-family: "Helvetica Neue", Arial, Helvetica, Geneva, sans-serif;*/</span>
<span class="k">font-family</span><span class="o">:</span> <span class="s2">"Charis SIL"</span><span class="o">,</span> <span class="k">serif</span><span class="p">;</span> <span class="c">/* Make non-code text serif. */</span>
<span class="k">line-height</span><span class="o">:</span> <span class="m">145%</span><span class="p">;</span> <span class="c">/* added for some line spacing of text. */</span>
<span class="k">width</span><span class="o">:</span> <span class="m">105ex</span><span class="p">;</span> <span class="c">/* instead of 'inherit' for shorter lines */</span>
<span class="o">...</span>
<span class="p">}</span>
</pre></div>
<p>3. Add styles to specify sizes for headers.</p>
<div class="highlight"><pre><span class="c">/* Set the size of the headers */</span>
<span class="nt">div</span><span class="nc">.text_cell_render</span> <span class="nt">h1</span> <span class="p">{</span>
<span class="k">font-size</span><span class="o">:</span> <span class="m">18pt</span><span class="p">;</span>
<span class="p">}</span>
<span class="nt">div</span><span class="nc">.text_cell_render</span> <span class="nt">h2</span> <span class="p">{</span>
<span class="k">font-size</span><span class="o">:</span> <span class="m">14pt</span><span class="p">;</span>
<span class="p">}</span>
</pre></div>
<p>Then, in the `/codemirror/lib subfolder, there’s a file called
codemirror.css. In here we can change the font used for code, both input
and interpreter output. I’m using Consolas.</p>
<div class="highlight"><pre><span class="nc">.CodeMirror</span> <span class="p">{</span>
<span class="k">font-family</span><span class="o">:</span> <span class="n">Consolas</span><span class="o">,</span> <span class="k">monospace</span><span class="p">;</span>
<span class="p">}</span>
</pre></div>
<p>Obviously these changes only affect notebooks you view on your local
machine, and whoever views your notebooks on their own machine, or on
<a href="http://nbviewer.ipython.org">nbviewer</a> will see the default style.</p>
<p>Here are before and after shots of these changes:</p>
<p><a href="../images/ipynb_unstyled.png">
<img src="../images/ipynb_unstyled.png" width = 500px />
</a>
<a href="../images/ipynb_styled.png">
<img src="../images/ipynb_styled.png" width = 500px /></a>
</a></p>
<h2>Fixing it (globally?)</h2>
<p>So this is all cute right? And it’s nice that we can do some
customizations to the notebook, but, you know, big deal.</p>
<p>I’d argue this is actually more important than just aesthetic tinkering.
The IPython notebook is becoming a one-stop-shop for exploration,
collaboration, publication, distribution, and replication in data
analysis. Like I said above, I think it’s not unreasonable that
notebooks could replace a large class of scientific papers. But to do
that, it has to perform as well as all the fragmented tools that
researchers are currently using. Otherwise, people are going to keep
pasting their code and results into Word and Latex documents. In other
words, the notebook has to work not just as an interactive environment,
but also as a static document. The IPython team realizes this, which is
why tools like <a href="https://github.com/ipython/nbconvert">nbconvert</a> exist.</p>
<p>People are doing amazing things in the notebook. The typography should
encourage people to read them, and not just serve as suped-up comments.</p>
<p>Tools are often strongly associated with aesthetic characteristics that
are only peripheral to the tool itself. ggplot can make charts that look
however you want, but when people think of ggplot, they think of the
gray background and the Color Brewer palette. And while main selling
point of ggplot is its abstraction of the graph-making process, I think
it was the distinctive and attractive style of its graphs that made it
catch on so successfully. On the opposite end of the spectrum, when
people think of Stata graphics, they think of <a href="http://www.survey-design.com.au/distrib2.png">this</a>, and wince. And
Latex will typeset documents with whatever crazy font you want, but in
everyone’s mind, Latex ⇔ Computer Modern (for better or worse).
Design defaults are important: they’re marketing and they encourage good
habits by your users. It’d be a shame to have it be that people think of
the IPython notebook and picture long lines of small, single-spaced
Helvetica Neue.</p>
<p>It’s an insanely powerful tool. It’d be awesome if it were beautiful
too, and that goal seems eminently do-able.</p>How do you speed up 40,000 weighted least squares calculations? Skip 36,000 of them.2012-05-14T23:46:00-04:00Carltag:slendermeans.org,2012-05-14:lowess-speed.html<p>Despite having finished all the programming for Chapter 2 of <span class="caps">MLFH</span> a
while ago, there’s been a long hiatus since the<a href="../ml4h-ch2-p1.html">first post on that
chapter</a>.</p>
<h2>(S)lowess</h2>
<p>Why the delay? The second part of the code focuses on two procedures:
lowess scatterplot smoothing, and logistic regression. When implementing
the former in <a href="http://statsmodels.sourceforge.net/devel/generated/statsmodels.nonparametric.api.lowess.html#statsmodels.nonparametric.api.lowess">statsmodels</a>, I found that it was running <em>dog slow</em> on
the data—in this case a scatterplot of 10,000 height-vs.-weight points.
Indeed, for these 10,000 points, lowess, run with the default
parameters, required about 23 seconds. After importing modules and
defining variables according to my <a href="https://github.com/carljv/Will_it_Python/tree/master/MLFH/CH2">IPython notebook</a>, we can run
<code>timeit</code> on the function:</p>
<div class="highlight"><pre><span class="o">%</span><span class="n">timeit</span> <span class="o">-</span><span class="n">n</span> <span class="mi">3</span> <span class="n">lowess</span><span class="o">.</span><span class="n">lowess</span><span class="p">(</span><span class="n">heights</span><span class="p">,</span> <span class="n">weights</span><span class="p">)</span>
</pre></div>
<p>This results in</p>
<div class="highlight"><pre>3 loops, best of 3: 42.6 s per loop
</pre></div>
<p>on the machine I’m writing this on (a Windows laptop with a 2.67 GHz i5
processor; timings are faster, but still in the 30 sec. range on my 2.5
GHz i7 Macbook).</p>
<p>An R user—or really a user of any other statistical package—is going
to be confused here. We’re all used to lowess being a relatively
instantaneous procedure. It’s an oft-used option for graphics packages
like Lattice and ggplot2 — and it doesn’t take 20-30 seconds to
generate a plot with a lowess curve superimposed. So what’s the deal? Is
something wrong with the statsmodels implementation?</p>
<h2>The naive lowess algorithm</h2>
<p>Short answer: no. Long answer: yeah, kinda. Let’s start by looking at
the lowess algorithm in general, sticking to the 2-D y-vs.-x scatterplot
case. (I don’t really find multi-dimensional lowess useful anyway; maybe
others put it to frequent use. If so, I’d like to hear about it).</p>
<p>Let’s say we have data <em>{x<sub>1</sub>, …, x<sub>n</sub>}</em> and <em>{y<sub>1</sub>, …, y<sub>n</sub>}</em>. The
idea is to fit a set of values <em>{y<sup>*</sup><sub>1</sub>, …, y<sup>*</sup><sub>n</sub>}</em> where each is the
prediction at <em>x<sub>i</sub></em> from a weighted regression using a fixed
neighborhood of points around <em>x<sub>i</sub></em>. The weighting scheme puts less
weight on points that are far from <em>x<sub>i</sub></em>. The regression can be linear,
or polynomial, but linear is typical, and lowess procedures that use
polynomials with more than 2 degrees are rare.</p>
<p>After we get this first set of fits, we usually run the regressions a
few more times, each time modifying the weights to take into account
residuals from the previous fit. These “robustifying” iterations apply
successively less weight to outlying points in the data, reducing their
influence on the final curve.</p>
<p>Here’s the recipe:</p>
<ol>
<li>Select the number of neighbors, <em>k</em>, to use in each local
regression, and the number of robustifying iterations.</li>
<li>Sort the data, both <em>x </em>and <em>y</em>,<em> </em>by the order of the <em>x</em>-values.</li>
<li>For each <em>x<sub>i</sub></em> in <em>{x<sub>1</sub>, … x<sub>n</sub>}</em>:<ol>
<li>Find the <em>k</em> points nearest to <em>x<sub>i</sub></em> (the <em>neighborhood</em>).</li>
<li>Calculate the weights for each <em>x<sub>j</sub></em> in the neighborhood. This
requires:<ol>
<li>Calculating the distance between each <em>x<sub>j</sub></em> and <em>x<sub>i</sub></em> and
applying a weighting function to these distances.</li>
<li>Take the weights calculated from the previous fit’s
residuals (if this is not the first fit) and multiply them
by the distance weights.</li>
</ol>
</li>
<li>Run a regression of the <em>y<sub>j</sub></em>s on the <em>x<sub>j</sub></em>s in the
neighborhood, using the weights calculated in part B above.
Predict <em>y<sup>*</sup><sub>i</sub></em>.</li>
</ol>
</li>
<li>Calculate the residuals from this fitted series of <em>{y<sup>*</sup><sub>1</sub>, …,
y<sup>*</sup><sub>n</sub>}</em>, and compute a weight from each of them.</li>
<li>Repeat 3 and 4 for the specified number of robustifying iterations.</li>
</ol>
<p>Clearly, this is an expensive procedure. For 10,000 points and 3
robustifying iterations (which is the default in R and statsmodels),
you’re calculating weights and running regressions 40,000 times (1
initial fit + 3 robustifying iterations). Running R’s <code>lm.fit</code> (which
is the lean, fast engine under <code>lm</code>) 40,000 times costs about 11
seconds. Add on all the costs from weight calculations—which will
happen 40,000 × <em>k</em> times, since a weight needs to be calculated for
each point’s neightbor—-and it’s not surprising that the statsmodels
version is as slow as it is. It is an inherently expensive algorithm.</p>
<h2>Cheating our way to a faster lowess</h2>
<p>The question is, why is R’s lowess so fast? The answer is that R—-and
most other implementations, going back to Clevelands <a href="http://www.netlib.org/go/lowess.f">lowess.f</a> Fortan
program—don’t perform lowess calculations on all that data.</p>
<p>If you look at the <a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/lowess.html">R help file for lowess</a>, you’ll see that in
addition to the parameters we’d expect—the data <code>x</code> and <code>y</code>; a
parameter to determine the size of the neighborhood; and the number of
robustifying iterations—there’s an argument called <code>delta</code>.</p>
<p>The idea behind <code>delta</code> is the following: <em>x<sub>i</sub></em> that are close together
aren’t very interesting. If we’ve already calculated <em>y<sup>*</sup><sub>i</sub></em> from the
neighborhood of data around <em>x<sub>i</sub></em>, and |<em>x<sub>i+1</sub></em> - <em>x<sub>i</sub></em>| < <code>delta</code>,
then we don’t really need to calculate <em>y<sup>*</sup><sub>i+1</sub></em>. It’s bound to be near <em>y<sup>*</sup><sub>i</sub></em>.</p>
<p>Instead let’s go out to an <em>x<sub>j</sub></em> that’s farther away from <em>x<sub>i</sub></em>—-say
the farthest one still within <code>delta</code> distance. Let’s fit another
weighted regression here. All those points in between—within that delta
distance—can be approximated by a line going between the two regression
fits we made. Then, just keep skipping along in these delta-sized
steps—back-filling the predictions by linear interpolation as we
go—until the end of the data.</p>
<p>How much work have we saved ourselves? Assume as above 10,000 points and
4 iterations. If the <em>x</em>‘s are uniformly distributed along the axis, and
we take <code>delta</code> to be <code>0.01 * (max(x) - min(x))</code> (which is the default
value in R), then we’re only running 100 regressions per iteration, or
400 overall. Compared to the 40,000 that statsmodels is running, we can
see why R is much faster. It’s cheating!</p>
<p>This kind of approximating is fine, really. It’s just assuming that, if
our model is <em>y = f(x) + e</em> and <em>f(x)</em> is what we’re trying to estimate
with lowess, we can take the linear approximation of it in small neighborhoods.</p>
<h2>Implementing a faster lowess in Python</h2>
<p>Algorithms for lowess written in low level languages aren’t hard to
find. In addition to Cleveland’s <a href="http://www.netlib.org/go/lowess.f">Fortran implementation</a>,
there’s also a <a href="http://svn.r-project.org/R/trunk/src/library/stats/src/lowess.c">C version</a> used by R (which is basically a direct
translation of Cleveland’s, but without all the pesky commenting to let
you know what it’s doing).</p>
<p>The <a href="https://github.com/statsmodels/statsmodels/blob/master/statsmodels/nonparametric/smoothers_lowess.py">statsmodel version</a> though, is nicely organized—broken into
sub-functions with clear names, and exploiting vectorized operations.
But it’s slowness is not because it doesn’t exploit the <code>delta</code> trick.
It also runs some expensive operations, like a call to SciPy’s <code>lstsq</code>
function in each tight loop.</p>
<p>So, in addition to adding the delta trick, we’d like to speed up those
calculations in the tight loop (part 3 in the list above) as much as
possible. Luckily, Cython lets us split the difference.</p>
<p>My Cython version of lowess is in my github repo, <a href="https://github.com/carljv/Will_it_Python/tree/master/MLFH/CH2/lowess%20work">here</a>, in the file
cylowess.py. There’s also an IPython notebook demonstrating it in
action, and files comprising a testing suite, comparing its output to R’s.</p>
<p>Let’s take a look at some real squiggly data to see how it works. The
Silverman motorcycle collision data, which is available as <code>mcycle</code> in
R’s <code>MASS</code> package, is great test data for non-parametric curve fitting
procedures. In addition to not having any simple parametric shape, it’s
got some edge case issues that can cause problems, like repeated x-values.</p>
<p>This plot compares my lowess implementation with statsmodels’ and R’s:</p>
<p><a href="../images/motorcycle-lowess-comparisons.png">
<img src="../images/motorcycle-lowess-comparisons.png" width=350px />
</a></p>
<p>The aggregate difference between R’s lowess and mine?</p>
<div class="highlight"><pre><span class="k">print</span> <span class="s">'R and New Lowess MAD: </span><span class="si">%5.2e</span><span class="s">'</span> <span class="o">%</span>
<span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">abs</span><span class="p">(</span><span class="n">r_lowess</span><span class="p">[</span><span class="s">'y'</span><span class="p">]</span> <span class="o">-</span> <span class="n">new_lowess</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">]))</span>
<span class="n">R</span> <span class="ow">and</span> <span class="n">New</span> <span class="n">Lowess</span> <span class="n">MAD</span><span class="p">:</span> <span class="mf">1.62e-13</span>
</pre></div>
<p>So it looks like it works.</p>
<p>Now let’s look at some timings. I’ll create some test data: 10,000
points, where <code>x</code> is uniformly distributed on [0, 20], and
<code>y = sin(x) + N(0, 0.5)</code>.</p>
<p>Statsmodel’s lowess:</p>
<div class="highlight"><pre><span class="o">%</span><span class="n">timeit</span> <span class="o">-</span><span class="n">n</span> <span class="mi">3</span> <span class="n">smlw</span><span class="o">.</span><span class="n">lowess</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
<span class="mi">3</span> <span class="n">loops</span><span class="p">,</span> <span class="n">best</span> <span class="n">of</span> <span class="mi">3</span><span class="p">:</span> <span class="mf">22.8</span> <span class="n">s</span> <span class="n">per</span> <span class="n">loop</span>
</pre></div>
<p>The new Cythonized lowess:</p>
<div class="highlight"><pre><span class="o">%</span><span class="n">timeit</span> <span class="o">-</span><span class="n">n</span> <span class="mi">3</span> <span class="n">cyl</span><span class="o">.</span><span class="n">lowess</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
<span class="mi">3</span> <span class="n">loops</span><span class="p">,</span> <span class="n">best</span> <span class="n">of</span> <span class="mi">3</span><span class="p">:</span> <span class="mf">10.8</span> <span class="n">s</span> <span class="n">per</span> <span class="n">loop</span>
</pre></div>
<p>This is without the <code>delta</code> trick. Skimming the fat off of those
tight-looped operations and Cythonizing them cut the run time in half.
11 seconds still sucks, though, so let’s see what <code>delta</code> gets us.</p>
<div class="highlight"><pre><span class="n">delta</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">max</span><span class="p">()</span> <span class="o">-</span> <span class="n">x</span><span class="o">.</span><span class="n">min</span><span class="p">())</span> \<span class="o">*</span> <span class="mf">0.01</span>
<span class="o">%</span><span class="n">timeit</span> <span class="o">-</span><span class="n">n</span> <span class="mi">3</span> <span class="n">cyl</span><span class="o">.</span><span class="n">lowess</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">delta</span> <span class="o">=</span> <span class="n">delta</span><span class="p">)</span>
<span class="mi">3</span> <span class="n">loops</span><span class="p">,</span> <span class="n">best</span> <span class="n">of</span> <span class="mi">3</span><span class="p">:</span> <span class="mi">125</span> <span class="n">ms</span> <span class="n">per</span> <span class="n">loop</span>
</pre></div>
<p>Much better. That’s the kind of time skipping 36,000 weighted
least-squares calculations will save you. Given that this is some curvy
data, is all this linear interpolation acceptable? I’ll re-run both with
a better level of the <code>frac</code> parameter; the default is 2/3, but I’ll
reduce it to 1/10 to use smaller neighborhoods in the regression and
allow for more curvature. Here’s the plot:</p>
<div class="highlight"><pre>sm_lowess = smlw.lowess(y, x, frac = 0.1)
new_lowess = cyl.lowess(y, x, frac = 0.1, delta = delta)
</pre></div>
<p><a href="../images/sine-10k-pts-lowess-compare.png">
<img src="../images/sine-10k-pts-lowess-compare.png" width=400px />
</a></p>
<p>Which looks just as good as the non-interpolated version, but doesn’t
leave you twiddling your thumbs.</p>
<h2>Conclusion</h2>
<p>After all this, we have a version of lowess that’s competitive with R’s
<code>lowess</code> function. R also has a much richer <code>loess</code> function, for which
there’s no real statmodels equivalent. <code>loess</code> is a full-blown class
from which one can make predictions and compute confidence intervals,
among other things. It also allows for fitting a higher-dimensional
surface, not just a curve. But I have a day job, so that’s all for some
other time. This kind of simple lowess is typically enough for most needs.</p>
<p>With this obsessive compulsive diversion into the guts of lowess out of
the way, I’ll wrap up Chapter 2 of <span class="caps">MLFH</span> in my next post.</p>Machine Learning for Hackers Chapter 2, Part 1: Summary stats and density estimators2012-05-01T04:00:00-04:00Carltag:slendermeans.org,2012-05-01:ml4h-ch2-p1.html<p>Chapter 2 of <span class="caps">MLFH</span> summarizes techniques for exploring your data:
determining data types, computing quantiles and other summary
statistics, and plotting simple exploratory graphics. I’m not going to
replicate it in its entirety; I’m just going to hit some of the more
involved or interesting parts. The IPython notebook I created for this
chapter, which lives <a href="https://github.com/carljv/Will_it_Python/tree/master/MLFH/CH2">here</a>, contains more code than I’ll present on
the blog.</p>
<p>This part’s highlights:</p>
<ol>
<li>Pandas objects, as we’ve seen before, have methods that provide
simple summary statistics.</li>
<li>The plotting methods in Pandas let you pass parameters to the
Matplotlib functions they call. I’ll use this feature to mess around
with histogram bins.</li>
<li>The <code>gaussian_kde</code> (kernel density estimator) function in
<code>scipy.stats.kde</code> provides density estimates similar to R’s
<code>density</code> function for Gaussian kernels. The <code>kdensity</code> function, in
<code>statsmodels.nonparametric.kde</code> provides that and other kernels, but
given the state of <code>statsmodels</code><span class="quo">‘</span> documentation, you would probably
only find this function by accident. It’s also substantially slower
than <code>gaussian_kde</code> on large data. *<em>[Not quite so! See
update at the end.]</em></li>
</ol>
<h2>Height and weight data</h2>
<p>The data analyzed in this chapter are the sexes, heights and weights, of
10,000 people. The raw file is a <span class="caps">CSV</span> that I import using <code>read_table</code> in Pandas:</p>
<div class="highlight"><pre><span class="n">heights_weights</span> <span class="o">=</span>
<span class="n">read_table</span><span class="p">(</span><span class="s">'data/01_heights_weights_genders.csv'</span><span class="p">,</span> <span class="n">sep</span> <span class="o">=</span> <span class="s">','</span><span class="p">,</span> <span class="n">header</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
</pre></div>
<p>Inspecting the data with <code>head</code>,</p>
<div class="highlight"><pre><span class="k">print</span> <span class="n">heights_weights</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</pre></div>
<p>gives us:</p>
<div class="highlight"><pre> Gender Height Weight
0 Male 73.847017 241.893563
1 Male 68.781904 162.310473
2 Male 74.110105 212.740856
3 Male 71.730978 220.042470
4 Male 69.881796 206.349801
5 Male 67.253016 152.212156
6 Male 68.785081 183.927889
7 Male 68.348516 167.971110
8 Male 67.018950 175.929440
9 Male 63.456494 156.399676
</pre></div>
<p>So it looks like heights are in inches, and weights are in pounds. It
also looks like the dataset is evenly split between men and women, since</p>
<div class="highlight"><pre>heights_weights.groupby('Gender')['Gender'].count()
</pre></div>
<p>results in:</p>
<div class="highlight"><pre>Gender
Female 5000
Male 5000
</pre></div>
<p>The data are simple, clean, and appear to have imported correctly. So,
we can start looking at some simple summaries.</p>
<h2>Numeric summaries, especially quantiles</h2>
<p>The first part of Chapter 2 covers the basic summary statistics: means,
medians, variances, and quantiles. The authors hand-roll the mean,
median, and variance functions to see how each is calculated. All of
these methods are available as methods to Pandas series, or as NumPy
functions (which are typically what’s called by equivalent Pandas methods).</p>
<p>The <code>describe</code> method of Pandas series and data frames, which we saw in
<a href="../ml4h-ch1-p3.html">Part 3 of Chapter 1</a>, gives summary statistics. The summary stats for
the height variable are:</p>
<div class="highlight"><pre><span class="n">heights</span> <span class="o">=</span> <span class="n">heights_weights</span><span class="p">[</span><span class="s">'Height'</span><span class="p">]</span>
<span class="n">heights</span><span class="o">.</span><span class="n">describe</span><span class="p">()</span>
<span class="n">count</span> <span class="mf">10000.000000</span>
<span class="n">mean</span> <span class="mf">66.367560</span>
<span class="n">std</span> <span class="mf">3.847528</span>
<span class="nb">min</span> <span class="mf">54.263133</span>
<span class="mi">25</span><span class="o">%</span> <span class="mf">63.505620</span>
<span class="mi">50</span><span class="o">%</span> <span class="mf">66.318070</span>
<span class="mi">75</span><span class="o">%</span> <span class="mf">69.174262</span>
<span class="nb">max</span> <span class="mf">78.998742</span>
</pre></div>
<p>The heights all lay within a reasonable range, with no apparent outliers
from bad data. The default quantile range in <code>describe</code> is 50%, so we
get the 75th and 25th percentiles. This can be changed with the
<code>percentile_width</code> argument; for example, <code>percentile_width = 90</code> would
give the 95th and 5th percentiles.</p>
<p>There doesn’t seem to be a direct analog to R’s <code>range</code> function, which
calculates the difference between the maximum and minimum value of a
vector, nor for the <code>quantile</code>, which can calculate the quantiles at any
given a series of probabilities. These are easy enough to replicate though.</p>
<blockquote>
<p><strong>Note: </strong>Nathaniel Smith, in comments, points out that R’s <code>range</code>
function doesn’t do this either, but just returns the min and max of a
vector. There <em>is</em> a function for this in NumPy, though: the
<code>my_range</code> function below gives the same result as would
<code>np.ptp(heights.values)</code>. <code>ptp</code> is the “peak-to-peak” (min-to-max) function.</p>
</blockquote>
<p>Range is trivial:</p>
<div class="highlight"><pre><span class="k">def</span> <span class="nf">my_range</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd">Difference between the max and min of an array or Series</span>
<span class="sd">'''</span>
<span class="k">return</span> <span class="n">s</span><span class="o">.</span><span class="n">max</span><span class="p">()</span> <span class="o">-</span> <span class="n">s</span><span class="o">.</span><span class="n">min</span><span class="p">()</span>
</pre></div>
<p>Calling this, we get a range of 78.99 − 54.26 = 24.63 inches.</p>
<p>Next, a <code>quantiles</code> function to mimic R’s. We can just make a wrapper
around the <code>quantile</code> method, mapping it along a sequence of provided probabilities.</p>
<div class="highlight"><pre><span class="k">def</span> <span class="nf">my_quantiles</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">prob</span> <span class="o">=</span> <span class="p">(</span><span class="mf">0.0</span><span class="p">,</span> <span class="mf">0.25</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">)):</span>
<span class="sd">'''</span>
<span class="sd">Calculate quantiles of a series.</span>
<span class="sd">Parameters:</span>
<span class="sd">-----------</span>
<span class="sd">s : a pandas Series</span>
<span class="sd">prob : a tuple (or other iterable) of probabilities at</span>
<span class="sd">which to compute quantiles. Must be an iterable,</span>
<span class="sd">even for a single probability (e.g. prob = (0.50,)</span>
<span class="sd">not prob = 0.50).</span>
<span class="sd">Returns:</span>
<span class="sd">--------</span>
<span class="sd">A pandas series with the probabilities as an index.</span>
<span class="sd">'''</span>
<span class="n">q</span> <span class="o">=</span> <span class="p">[</span><span class="n">s</span><span class="o">.</span><span class="n">quantile</span><span class="p">(</span><span class="n">p</span><span class="p">)</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">prob</span><span class="p">]</span>
<span class="k">return</span> <span class="n">Series</span><span class="p">(</span><span class="n">q</span><span class="p">,</span> <span class="n">index</span> <span class="o">=</span> <span class="n">prob</span><span class="p">)</span>
</pre></div>
<p>Note that the default argument gives quartiles. We can get deciles by calling:</p>
<div class="highlight"><pre><span class="k">print</span> <span class="n">my_quantiles</span><span class="p">(</span><span class="n">heights</span><span class="p">,</span> <span class="n">prob</span> <span class="o">=</span> <span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mf">1.1</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">))</span>
</pre></div>
<p>which spits out:</p>
<div class="highlight"><pre>0.0 54.263133
0.1 61.412701
0.2 62.859007
0.3 64.072407
0.4 65.194221
0.5 66.318070
0.6 67.435374
0.7 68.558072
0.8 69.811620
0.9 71.472149
1.0 78.998742
</pre></div>
<blockquote>
<p><strong>Note</strong>: the <code>quantiles</code> function I’ve written is a little awkward
when dealing with a single quantile. Because the list comprehension
that computes the qunatiles requires that the <code>prob</code> argument be an
iterable, you would have to pass a list, tuple, array or other
iterable with a single value. You can’t just pass it a float. I’ve hit
this issue a few times writing Python functions–where it’s difficult
to make code robust to both iterable and singleton arguments. If
anyone has tips on this (should I really be doing type checking?), I’d
be thrilled to hear them.</p>
</blockquote>
<h2>Histograms</h2>
<p>Next the authors mess around with histograms and density plots to
explore the distribution of the data. Noting that different bin sizes
for histograms can affect how we perceive the data’s distribution, they
plot histograms for a few different bin widths.</p>
<p>In Matplotlib, bins are not specified by their width, as is possible
ggplot. We can either give Matplotlib the number of bins we want it to
plot, or specify the actual bin-edge locations. It’s not difficult to
translate a desired bin width into either one of these types of
argument. I’ll provide the sequence of bins.</p>
<p>First, 1-inch bins:</p>
<div class="highlight"><pre><span class="n">bins1</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">heights</span><span class="o">.</span><span class="n">min</span><span class="p">(),</span> <span class="n">heights</span><span class="o">.</span><span class="n">max</span><span class="p">(),</span> <span class="mf">1.0</span><span class="p">)</span>
<span class="n">heights</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">bins</span> <span class="o">=</span> <span class="n">bins1</span><span class="p">,</span> <span class="n">fc</span> <span class="o">=</span> <span class="s">'steelblue'</span><span class="p">)</span>
</pre></div>
<p><a href="../images/height_hist_bins1.png">
<img src= "../images/height_hist_bins1.png" width=450px />
</a></p>
<p>Note how I’m using the Pandas <code>hist</code> method, which, using a <code>**kwargs</code>
argument, can pass parameters to the Matplotlib plotting functions.
Next, 5-inch bins:</p>
<div class="highlight"><pre><span class="n">bins5</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">heights</span><span class="o">.</span><span class="n">min</span><span class="p">(),</span> <span class="n">heights</span><span class="o">.</span><span class="n">max</span><span class="p">(),</span> <span class="mf">5.</span><span class="p">)</span>
<span class="n">heights</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">bins</span> <span class="o">=</span> <span class="n">bins5</span><span class="p">,</span> <span class="n">fc</span> <span class="o">=</span> <span class="s">'steelblue'</span><span class="p">)</span>
</pre></div>
<p><a href="../images/height_hist_bins5.png">
<img src= "../images/height_hist_bins5.png" width=450px />
</a></p>
<p>And finally, 0.001-inch bins:</p>
<div class="highlight"><pre>bins001 = np.arange(heights.min(), heights.max(), .001)
heights.hist(bins = bins001, fc = 'steelblue')
plt.savefig('height_hist_bins001.png')
</pre></div>
<p><a href="../images/height_hist_bins001.png">
<img src= "../images/height_hist_bins001.png" width=450px />
</a></p>
<p>These all match the figures in the book, so I’m probably doing it right.</p>
<h2>Kernel density estimators in SciPy and statsmodels</h2>
<p>R’s <code>density</code> function computes kernel density estimates. The default
kernel is Gaussian, but you can also use Epanechnikov, rectangular,
triangular, biweight, cosine kernels.</p>
<p>In Python, it looks like you have two options for kernel density. The
first is <code>gaussian_kde</code> from the <code>scipy.stats.kde</code> module. This provides
a Gaussian kernel density estimate only. The other is <code>kdensity</code> in the
<code>statsmodels.nonparametric.kde</code> module, which provides alternative
kernels similar to R.</p>
<p>I actually wasn’t aware of the <code>kdensity</code> function for a while, until I
stumbled upon a mention of it on a mailing list archive. I couldn’t find
it in the statsmodels <a href="http://statsmodels.sourceforge.net/">documentation</a>. Statsmodels, generally, seems
to have a lot of undocumented functionality; not surprising for a young,
rapidly-expanding project.</p>
<p>Playing with both functions, I found some pros and cons for each.
Obviously <code>kdensity</code> provides an option of kernels, whereas
<code>gaussian_kde</code> does not. <code>kdensity</code> also generates simpler output than
<code>gaussian_kde</code>. <code>kdensity</code> provides a tuple of two arrays–the grid of
points at which the density was estimated, and the estimated density of
those points. <code>gaussian_kde</code> provides an object that you have to
evaluate on a set of points to get an array of estimated densities. So
essentially, you’re calling it twice, and I don’t see much point to that redundancy.</p>
<p>On the other hand <code>kdensity</code> gets <em>much</em> slower than <code>gaussian_kde</code> as
the number of points increases. For the 10,000 points in the = <code>heights</code>
array, <code>gaussian_kde</code> took about 3.3 seconds to output the array of
estimated densities. <code>kdensity</code> wasn’t finished after several minutes. I
haven’t looked carefully at the source code of the two functions, but I
assume <code>kdensity</code><span class="quo">‘</span>s problem is that at some point it creates a temporary
<code>NxN</code> array, which for <code>N = 10,000</code> is going to gum things up. Setting
the <code>gridsize</code> argument in <code>kdensity</code> to something even as large as
<code>5000</code>, cuts the size of the temporary array in half, and reduces the
running time to about 3 seconds.</p>
<p>This is probably worth exploring in a future post. In the meantime, I’m
going stick with <code>gaussian_kde</code> and plot some densities.
<strong>Note:</strong> See the update below. I’ve
updated the <a href="https://github.com/carljv/Will_it_Python/tree/master/MLFH/CH2" title="Chapter 2 github repo">IPython notebook</a> for this chapter to use Statsmodels’
<span class="caps">KDE</span> class instead of SciPy.]</p>
<p>First, heights:</p>
<div class="highlight"><pre><span class="n">density</span> <span class="o">=</span> <span class="n">kde</span><span class="o">.</span><span class="n">gaussian_kde</span><span class="p">(</span><span class="n">heights</span><span class="o">.</span><span class="n">values</span><span class="p">)</span>
<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">heights</span><span class="o">.</span><span class="n">values</span><span class="p">),</span>
<span class="n">density</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">heights</span><span class="o">.</span><span class="n">values</span><span class="p">)))</span>
</pre></div>
<p><a href="../images/heights_density.png">
<img src= "../images/heights_density.png" width=450px />
</a></p>
<p>The sorting of the <code>heights</code> array is to make the lines connect nicely.
Otherwise, the lines will connect from point-to-point in the order they
occur in the array; we want the density curve to connect points left-to-right.</p>
<p>Notice the slight bi-modality in the figure. What we’re likely seeing is
a mixture of male and female distributions. We can plot those separately.</p>
<div class="highlight"><pre><span class="c"># Pull out male and female heights as arrays over which to compute densities</span>
<span class="n">heights_m</span> <span class="o">=</span> <span class="n">heights</span><span class="p">[</span><span class="n">heights_weights</span><span class="p">[</span><span class="s">'Gender'</span><span class="p">]</span> <span class="o">==</span> <span class="s">'Male'</span><span class="p">]</span><span class="o">.</span><span class="n">values</span>
<span class="n">heights_f</span> <span class="o">=</span> <span class="n">heights</span><span class="p">[</span><span class="n">heights_weights</span><span class="p">[</span><span class="s">'Gender'</span><span class="p">]</span> <span class="o">==</span> <span class="s">'Female'</span><span class="p">]</span><span class="o">.</span><span class="n">values</span>
<span class="n">density_m</span> <span class="o">=</span> <span class="n">kde</span><span class="o">.</span><span class="n">gaussian_kde</span><span class="p">(</span><span class="n">heights_m</span><span class="p">)</span>
<span class="n">density_f</span> <span class="o">=</span> <span class="n">kde</span><span class="o">.</span><span class="n">gaussian_kde</span><span class="p">(</span><span class="n">heights_f</span><span class="p">)</span>
<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">heights_m</span><span class="p">),</span> <span class="n">density_m</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">heights_m</span><span class="p">)),</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Male'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">heights_f</span><span class="p">),</span> <span class="n">density_f</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">heights_f</span><span class="p">)),</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Female'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span>
</pre></div>
<p><a href="../images/height_density_bysex.png">
<img src= "../images/height_density_bysex.png" width=450px />
</a></p>
<p>We also have a weight variable we can plot.</p>
<div class="highlight"><pre><span class="n">weights_m</span> <span class="o">=</span> <span class="n">heights_weights</span><span class="p">[</span><span class="n">heights_weights</span><span class="p">[</span><span class="s">'Gender'</span><span class="p">]</span> <span class="o">==</span> <span class="s">'Male'</span><span class="p">][</span><span class="s">'Weight'</span><span class="p">]</span><span class="o">.</span><span class="n">values</span>
<span class="n">weights_f</span> <span class="o">=</span> <span class="n">heights_weights</span><span class="p">[</span><span class="n">heights_weights</span><span class="p">[</span><span class="s">'Gender'</span><span class="p">]</span> <span class="o">==</span> <span class="s">'Female'</span><span class="p">][</span><span class="s">'Weight'</span><span class="p">]</span><span class="o">.</span><span class="n">values</span>
<span class="n">density_m</span> <span class="o">=</span> <span class="n">kde</span><span class="o">.</span><span class="n">gaussian_kde</span><span class="p">(</span><span class="n">weights_m</span><span class="p">)</span>
<span class="n">density_f</span> <span class="o">=</span> <span class="n">kde</span><span class="o">.</span><span class="n">gaussian_kde</span><span class="p">(</span><span class="n">weights_f</span><span class="p">)</span>
<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">weights_m</span><span class="p">),</span> <span class="n">density_m</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">weights_m</span><span class="p">)),</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Male'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">weights_f</span><span class="p">),</span> <span class="n">density_f</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">weights_f</span><span class="p">)),</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Female'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span>
</pre></div>
<p><a href="../images/weight_density_bysex.png">
<img src= "../images/weight_density_bysex.png" width=450px />
</a></p>
<p>To finish up, let’s move each density plot to its own subplot, to match
Figure 2-11 on page 51.</p>
<div class="highlight"><pre>fig, axes = plt.subplots(nrows = 2, ncols = 1, sharex = True, figsize = (9, 6))
plt.subplots_adjust(hspace = 0.1)
axes[0].plot(np.sort(weights_f), density_f(np.sort(weights_f)),
label = 'Female')
axes[0].xaxis.tick_top()
axes[0].legend()
axes[1].plot(np.sort(weights_m), density_m(np.sort(weights_m)),
label = 'Male')
axes[1].legend()
</pre></div>
<p><a href="../images/weight_density_bysex_sublot.png">
<img src= "../images/weight_density_bysex_subplot.png" width=450px />
</a></p>
<p>Here I’m using the <code>subplots</code> function, same as in <a href="../ml4h-ch1-p5.html">Part 5 of Chapter
1</a>, and sharing the x-axis to make clear the difference between the
distributions’ central tendencies.</p>
<h2>Conclusion</h2>
<p>I’ll wrap up Chapter 2 in the next post, where I’ll look at lowess
smoothing in Statsmodels, and get a little taste of logistic regression.</p>
<h2>Update!</h2>
<p>Statsmodels honcho skipper seabold sets me straight in the comments.
While the <code>kdensity</code> function is slow, statsmodels has an implementation
which uses Fast Fourier Transforms for Gaussian kernels and is
substantially faster than Scipy’s <code>gaussian_kde</code>.</p>
<p>For the heights array:</p>
<div class="highlight"><pre><span class="c"># Create a KDE object</span>
<span class="n">heights_kde</span> <span class="o">=</span> <span class="n">sm</span><span class="o">.</span><span class="n">nonparametric</span><span class="o">.</span><span class="n">kde</span><span class="o">.</span><span class="n">KDE</span><span class="p">(</span><span class="n">heights</span><span class="o">.</span><span class="n">values</span><span class="p">)</span>
<span class="c"># Estimate the density by fitting the object (default Gaussian kernel via FFT)</span>
<span class="n">heights_kde</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span>
</pre></div>
<p>We can then plot this vector of estimated densities,
<code>heights_kde.density</code> against the points in <code>heights_kde.support</code>.</p>
<p>I’ve updated the <a href="https://github.com/carljv/Will_it_Python/tree/master/MLFH/CH2" title="Chapter 2 github repo">IPython notebook</a> for this chapter to use
Statsmodels’ <span class="caps">KDE</span> throughout, so check it out for more detail.</p>Machine Learning for Hackers Chapter 1, Part 5: Trellis graphs.2012-04-27T04:00:00-04:00Carltag:slendermeans.org,2012-04-27:ml4h-ch1-p5.html<h2>Introduction</h2>
<p>This post will wrap up Chapter 1 of <span class="caps">MLFH</span>. The only task left is to
replicate the authors’ trellis graph on p. 26. The plot is made up of 50
panels, one for each <span class="caps">U.S.</span> state, with each panel plotting the number of
<span class="caps">UFO</span> sightings by month in that state.</p>
<p>The key takeaways from this part are, unfortunately, a bunch of gripes
about Matplotlib. Since I can’t transmit, blogospherically, the migraine
I got over the two afternoons I spent wrestling with this graph, let me
just try to succinctly list my grievances.</p>
<ol>
<li>Out-of-the-box, Matplotlib graphs are uglier than those produced by
either lattice or ggplot in R: The default color cycle is made up of
dark primary colors. Tick marks and labels are poorly placed in
anything but the simplest graphs. Non-data graph elements, like
bounding boxes and gridlines, are too prominent and take focus away
from the data elements.</li>
<li>The <span class="caps">API</span> is deeply confusing and difficult to remember. You have
various objects that live in various containers. To make adjustments
to graphs, you have to remember what container the thing you want to
adjust lives in, remember what the object and its property is
called, and then remember how Matplotlib’s <em>getting</em> and <em>setting</em>
procedures work.</li>
<li>The <code>pyplot</code> set of commands is supposed to provide convenience
functions, but these abstractions seem to leak early and often. Once
you need to make finer adjustments, you’re back to the underlying
<span class="caps">API</span> nightmare.</li>
<li>The documentation is both clear and comprehensive. But where it is
clear, it is not comprehensive, and where it is comprehensive, it is
not clear. For example, the <a href="http://matplotlib.sourceforge.net/users/artists.html">Artist tutorial</a> is a pretty clear
big picture of Matplotlib’s <span class="caps">API</span>. Once you need any detail, though,
you’re dealing with <a href="http://matplotlib.sourceforge.net/api/artist_api.html#module-matplotlib.lines">this</a>.</li>
<li>Creating trellis graphs requires way more manual work than in either
lattice or ggplot. The <code>supblot</code> functionality of Matplotlib is
highly flexible, but in most cases, the user is going to want the
code to do the thinking for them and not manually place every graph
(or do a bunch of bookkeeping with loops).</li>
</ol>
<p>With that off my chest, let me say that I have a ton of respect for
Matplotlib’s developers. It is a massively complex library, and clearly
very powerful and flexible. I have no doubt that Matplotlib gurus can do
amazing things. I’m just trying to convey the non-guru’s perspective.
Graphing libraries are difficult to design because they must be
incredibly flexible and allow users to manipulate all of the myriad
parts of the graph, but at the same time, they can’t overwhelm users
with detail when the flexibility isn’t needed. How anyone does
it–especially in an open-source project–I don’t know.</p>
<p>It’s also possible that I’m just <em>Doing it Wrong</em>, and in fact there are
easy ways to do all the things I’ve complained about. If that’s the
case, I hope someone reading this will enlighten me.</p>
<h2>Trellis graphs in R and Matplotlib</h2>
<p>In my opinion, trellis graphs are the “killer app” of multivariate data
visualization. I produce trellis line and scatter plots more than almost
any other kind of visualization. As such, it’s important for me to be
able to easily produce quality trellis graphs.</p>
<p>Trellis graphs are easy to create in R. The two most popular high-level
graphing packages in R, lattice and ggplot, both have simple methods for
creating them. Indeed, creating trellis graphs is lattice’s <em>raison
d’etre</em>, and the functionality and interface design in the package
revolves around dealing with trellis graph and the panels within. In
ggplot, the trellis is not such a central focus, but it still has
easy-to-use methods for making and modifying trellis graphs (which it
refers to as “faceted” graphs).</p>
<p>For example, the graph we want to make is a one liner in lattice:</p>
<div class="highlight"><pre>xyplot<span class="p">(</span>sightings <span class="o">~</span> year_month <span class="o">|</span> us_state<span class="p">,</span> data <span class="o">=</span> sightings_counts<span class="p">,</span>
type <span class="o">=</span> <span class="s">'l'</span><span class="p">,</span> layout <span class="o">=</span> <span class="kt">c</span><span class="p">(</span><span class="m">5</span><span class="p">,</span> <span class="m">10</span><span class="p">))</span>
</pre></div>
<p>Once you get the hang of R’s formula expressions–which doesn’t take
long–this is an easy, expressive way to create a trellis graph. The
authors use ggplot, which I find a bit less natural, but is still very easy.</p>
<p>Part of what makes trellis graphs to straightforward in R is that the
concept of factors, and their use as conditioning variables, is so
well-baked into the language. Matplotlib is essentially a plotting
utility for NumPy, so it’s designed to plot arrays, not rich data
structures. Without factors, without a notion of conditioning, and to a
lesser extent, without formulas, trellis graphs just don’t come naturally.</p>
<p>Pandas, though, has structures that, if a plotting library was designed
to understand them, might provide for easy trellis-ing. Even though
Pandas doesn’t have factors, I could see, for example, a <code>plot</code> method
for Pandas’ <code>groupby</code> objects that produces trellis graphs by default.</p>
<h2>Plotting the <span class="caps">UFO</span> trellis graph</h2>
<p>With all that throat-clearing out of the way, let’s get down to plotting
the graph. The authors plot 50 state panels, with a 10-by-5 layout.
Since I’ve included <span class="caps">D.C.</span> in my data, I have to plot 51 panels. You can
fit this in a 17-by-3 layout, but that’s pretty awkward. I’d like to
have 4 columns instead, but to fit 51 graphs, I’ll need 13 columns.
That’s 52 subplots, meaning the 13th row won’t have graphs in every
column, only the first three. I’m going to call these last three graphs
the <code>hangover</code> graphs, and I’m going to define it as its own variable to
help inform the layout procedures I run later.</p>
<p>Here are the layout parameters, then:</p>
<div class="highlight"><pre><span class="n">nrow</span> <span class="o">=</span> <span class="mi">13</span><span class="p">;</span> <span class="n">ncol</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span> <span class="n">hangover</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">us_states</span><span class="p">)</span> <span class="o">%</span> <span class="n">ncol</span><span class="p">[</span><span class="o">/</span><span class="n">sourcecode</span><span class="p">]</span>
</pre></div>
<p>Now let me get the “framing” objects in place: the figure, the subplot
layout, and the titles.</p>
<div class="highlight"><pre><span class="n">fig</span><span class="p">,</span> <span class="n">axes</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">nrow</span><span class="p">,</span> <span class="n">ncol</span><span class="p">,</span> <span class="n">sharey</span> <span class="o">=</span> <span class="bp">True</span><span class="p">,</span>
<span class="n">figsize</span> <span class="o">=</span> <span class="p">(</span><span class="mi">9</span><span class="p">,</span> <span class="mi">11</span><span class="p">))</span>
<span class="n">fig</span><span class="o">.</span><span class="n">suptitle</span><span class="p">(</span><span class="s">'Monthly UFO Sightings by U.S. State</span><span class="se">\n</span><span class="s">January 1990 through August 2010'</span><span class="p">,</span>
<span class="n">size</span> <span class="o">=</span> <span class="mi">12</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">subplots_adjust</span><span class="p">(</span><span class="n">wspace</span> <span class="o">=</span> <span class="o">.</span><span class="mo">05</span><span class="p">,</span> <span class="n">hspace</span> <span class="o">=</span> <span class="o">.</span><span class="mo">05</span><span class="p">)</span>
</pre></div>
<p>The <code>subplots</code> function is some recently-implement syntactic sugar
around Matplotlib’s <code>subplot</code> functionality (see the section on “Easy
Pythonic Subplots” <a href="http://matplotlib.sourceforge.net/users/whats_new.html#easy-pythonic-subplots">here</a>). The <code>sharey</code> argument tells Matplotlib
that the panels should all share the same y axis. Technically I want it
to share an x axis too, but Matplotlib kept throwing errors when I tried
to use the <code>sharex</code> argument with dates on the x-axis. Give the data,
the panels will end up sharing an x axis anyway, so this argument isn’t
necessary. The function returns two objects: <code>fig</code> refers to the overall
figure container, and <code>axes</code> is an array containing each of the
subplot/panel objects – so <code>axes[0, 0]</code> is the first panel.</p>
<p>Now the rest of the code:</p>
<div class="highlight"><pre><span class="n">num_state</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">nrow</span><span class="p">):</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">ncol</span><span class="p">):</span>
<span class="n">xs</span> <span class="o">=</span> <span class="n">axes</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span>
<span class="n">xs</span><span class="o">.</span><span class="n">grid</span><span class="p">(</span><span class="n">linestyle</span> <span class="o">=</span> <span class="s">'-'</span><span class="p">,</span> <span class="n">linewidth</span> <span class="o">=</span> <span class="o">.</span><span class="mi">25</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'gray'</span><span class="p">)</span>
<span class="k">if</span> <span class="n">num_state</span> <span class="o"><</span> <span class="mi">51</span><span class="p">:</span>
<span class="n">st</span> <span class="o">=</span> <span class="n">us_states</span><span class="p">[</span><span class="n">num_state</span><span class="p">]</span>
<span class="n">sightings_counts</span><span class="o">.</span><span class="n">ix</span><span class="p">[</span><span class="n">st</span><span class="p">,</span> <span class="p">]</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">ax</span> <span class="o">=</span> <span class="n">xs</span><span class="p">,</span> <span class="n">linewidth</span> <span class="o">=</span> <span class="o">.</span><span class="mi">75</span><span class="p">)</span>
<span class="n">xs</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.05</span><span class="p">,</span> <span class="o">.</span><span class="mi">95</span><span class="p">,</span> <span class="n">st</span><span class="o">.</span><span class="n">upper</span><span class="p">(),</span> <span class="n">transform</span> <span class="o">=</span> <span class="n">axes</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span><span class="o">.</span><span class="n">transAxes</span><span class="p">,</span>
<span class="n">verticalalignment</span> <span class="o">=</span> <span class="s">'top'</span><span class="p">)</span>
<span class="n">num_state</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">else</span><span class="p">:</span>
<span class="c"># Make extra subplots invisible</span>
<span class="n">plt</span><span class="o">.</span><span class="n">setp</span><span class="p">(</span><span class="n">xs</span><span class="p">,</span> <span class="n">visible</span> <span class="o">=</span> <span class="bp">False</span><span class="p">)</span>
<span class="n">xtl</span> <span class="o">=</span> <span class="n">xs</span><span class="o">.</span><span class="n">get_xticklabels</span><span class="p">()</span>
<span class="n">ytl</span> <span class="o">=</span> <span class="n">xs</span><span class="o">.</span><span class="n">get_yticklabels</span><span class="p">()</span>
<span class="c"># X-axis tick labels:</span>
<span class="c"># Turn off tick labels for all the the bottom-most</span>
<span class="c"># subplots. This includes the plots on the last row, and</span>
<span class="c"># if the last row doesn't have a subplot in every column</span>
<span class="c"># put tick labels on the next row up for those last</span>
<span class="c"># columns.</span>
<span class="c">#</span>
<span class="c"># Y-axis tick labels:</span>
<span class="c"># Put left-axis labels on the first column of subplots,</span>
<span class="c"># odd rows. Put right-axis labels on the last column</span>
<span class="c"># of subplots, even rows.</span>
<span class="k">if</span> <span class="n">i</span> <span class="o"><</span> <span class="n">nrow</span> <span class="o">-</span> <span class="mi">2</span> <span class="ow">or</span> <span class="p">(</span><span class="n">i</span> <span class="o"><</span> <span class="n">nrow</span> <span class="o">-</span> <span class="mi">1</span> <span class="ow">and</span> <span class="p">(</span><span class="n">hangover</span> <span class="o">==</span> <span class="mi">0</span> <span class="ow">or</span> <span class="n">j</span> <span class="o"><=</span> <span class="n">hangover</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)):</span>
<span class="n">plt</span><span class="o">.</span><span class="n">setp</span><span class="p">(</span><span class="n">xtl</span><span class="p">,</span> <span class="n">visible</span> <span class="o">=</span> <span class="bp">False</span><span class="p">)</span>
<span class="k">if</span> <span class="n">j</span> <span class="o">></span> <span class="mi">0</span> <span class="ow">or</span> <span class="n">i</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
<span class="n">plt</span><span class="o">.</span><span class="n">setp</span><span class="p">(</span><span class="n">ytl</span><span class="p">,</span> <span class="n">visible</span> <span class="o">=</span> <span class="bp">False</span><span class="p">)</span>
<span class="k">if</span> <span class="n">j</span> <span class="o">==</span> <span class="n">ncol</span> <span class="o">-</span> <span class="mi">1</span> <span class="ow">and</span> <span class="n">i</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
<span class="n">xs</span><span class="o">.</span><span class="n">yaxis</span><span class="o">.</span><span class="n">tick_right</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">setp</span><span class="p">(</span><span class="n">xtl</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mf">90.</span><span class="p">)</span>
</pre></div>
<p>Let’s walk through this:</p>
<p>First, set up a counter to keep track of what state we’re plotting. This
is a little un-Pythonic, but given what I do inside the loop, I couldn’t
think of a better way.</p>
<p>Now, for each row, column in the 13-by-4 array of panels (and this code
works for any row/column combination, as long as rows × columns >= 51):</p>
<ol>
<li>Assign the panel (“axis”) associated with this row, column pair to
its own variable.</li>
<li>Draw gray gridlines in the panel.</li>
<li>Go to the state in the <code>us_state</code> list corresponding to the current
value of the state counter.</li>
<li>Select this state out of the <code>sightings_counts</code> series and plot its
data in the current panel. Then, put a text label with the state’s
initials in the upper left corner.</li>
<li>If I’ve gone through all the states, and the state counter variable
is greater than 51, then make the panel invisible.</li>
<li>Assign the x- and y-axis <code>ticklabel</code> objects for the current panel
to variables. We’re going to manipulate their attributes.</li>
<li>
<p>Now some tricky stuff. I want do the following things to the tick labels:</p>
<ul>
<li>I want to turn off the x-axis tick labels for all but the
bottom-most panels, taking into account the hangover.</li>
<li>I want to alternate the y-axis tick labels so that they are on
the left for odd-numbered rows, and on the right for
even-numbered rows. Having labels on both sides makes the graph
easier to read, but having them on the same side on every row
leads to overcrowding and overlapping.</li>
</ul>
</li>
<li>
<p>Finally, I want the x-axis tick labels rotated 90 degrees. This
gives space to put as many as possible on the graph without
overcrowding (here, we can label every two years).</p>
</li>
</ol>
<p>Here’s the result:</p>
<p><a href="../images/ufo_ts_bystate.png">
<img src="../images/ufo_ts_bystate.png" width=500px />
</a></p>
<p>Not bad, I think. And maybe even better than the out-of-the-box version
you get with ggplot. But it was a tremendous amount of work, and I don’t
know if I’m going to be able to decipher this code six months from now.
It’s just a tremendous amount of bookkeeping I have to do keeping track
of what panel I’m in and where it’s located in the layout. There ought
to be a function that does this for me.</p>
<h2>Conclusion</h2>
<p>So that’s it for Chapter 1 of <span class="caps">MLFH</span>. Overall, I was pleasantly surprised
by Pandas and how easy it made loading, cleaning, and manipulating data.
While there are a couple of things from R that I missed, there were
several other things I though were easier and more flexible with Pandas.</p>
<p>On the other hand, going from lattice and ggplot to Matplotlib is like
taking a time machine back to the early ‘90s. After reading the
documentation and experimenting for several days, I still don’t think
I’m sure how it works. Hopefully I’ll get the hang of it as I go forward.</p>
<p>My take is the Python data analysis community is aware of its
“visualization gap” vis-a-vis R, and there are tools in the works to
solve this issue. I’ve heard whispers about “ggplot for Python” or “D3
for Python.” Everything is still in the early stages, and it will
probably be a while before better tools are available.</p>
<p>I’m also a little uncertain about the “x for Python” notion of creating
graphing libraries. Matplotlib’s <code>pyplot</code> is essentially a “Matlab for
Python” approach to graphics, and I don’t know that works to its credit.
I’d much rather have a solid, Pythonic graphing library that lets me
easily make publication-quality versions of the workhorse data graphics,
than have something that apes the latest faddish graphing tool. There
are a lot of smart people working on the problem, though, and I’m really
excited to see what happens.</p>Machine Learning for Hackers, Chapter 1, Part 4: Data aggregation and reshaping.2012-04-26T04:00:00-04:00Carltag:slendermeans.org,2012-04-26:ml4h-ch1-p4.html<h2>Introduction</h2>
<p>In the <a href="../ml4h-ch1-p3.html">last part</a> I made some simple summaries of the cleaned <span class="caps">UFO</span>
data: basic descriptive statistics and historgrams. At the very end, I
did some simple data aggregation by summing up the sightings by date,
and plotted the resulting time series. In this part, I’ll go further
with the aggregation, totalling sightings by state and month.</p>
<p>This takeaway from this part is that Pandas dataframes have some
powerful methods for aggregating and manipulating data. I’ll show
<code>groupby</code>, <code>reindex</code>, hierarchical indices, and <code>stack</code> and <code>unstack</code> in action.</p>
<h2>The shape of data: the long and the wide of it</h2>
<p>The first step in aggregating and reshaping data is to figure out the
final form you want the data to be in. This form is basically defined by
<em>content</em> and <em>shape</em>.</p>
<p>We know what we want the content to be: an entry in the data should give
the number of sightings in a state/month combination.</p>
<p>We have two choices for the shape: wide or long. The wide version of
this data would have months as the rows and states as the columns; it
would be a 248 by 51 table with the number of sigthings as entries. This
is a really natural way to shape the data if we were presenting a table
for example.</p>
<p>One of things I’ve picked up from my years of using R, though, is a
preference for long data. This is because R’s <code>factors</code> and <code>formulas</code>
with easy conditioning make it easier to work with long data. The most
common example is using <code>lattice</code> plots. To generate a lattice plot of
<code>y</code> over <code>x</code> with panels defined by a level of the variable <code>f</code>, you
just call <code>xyplot(y ~ x | f)</code>. For this to work though, the data must be
long, with <code>f</code> a column of factors, and the <code>x</code> column will likely be
some values repeated for each level of <code>f</code>. This seems kind of redundant
and unwieldy when you’re used to tables and spreadsheets, but it becomes
more natural when you starting working with tools like <code>lattice</code> or
<code>ggplot</code>, using more panel data, or doing more <a href="http://vita.had.co.nz/papers/plyr.html"><em>split-apply-combine</em></a>
or <em>map-reduce</em> types of procedures.</p>
<p>Because Pandas dataframes are so organized around indices, and because
Pandas allows for hierarchical indexing, we’ll find that it will be a
good strategy to shape data in a way that provides for informative
indices. This will give us access to a host of powerful methods to
manipulate the dataframe. In this case, as we’ll see, by making the data
long, we’ll be able to push most of the information into the dataframe’s index.</p>
<p>The long version of our <span class="caps">UFO</span> data would have rows defined by a
state/month pair, and a column recording the number of sightings for
that pair. In R–as the authors do in the book–you’ll have a dataframe
with three columns. The first two are the <em>factor</em> variables <code>USState</code>
and <code>YearMonth</code>. (I’m not actually sure these are technically factor
variables in the authors’ implementation, but they are conceptually).
The third is the sightings count.</p>
<p>In Pandas, since the state and month pairs identify unique observations,
it’s natural to make these indices of the dataframe. Pandas supports
hierarchical indexing by using unique tuples–here a tuple would be
<em>(state, month)</em>.</p>
<h2>Aggregating the data</h2>
<p>Now that we’ve decided the form of the data, let’s implement all this.</p>
<p>The first step is to create a year-month variable. I do this just by
taking the date of each sighting, and calculating a new date with the
same year and month, but set to the first of the month. This is just
another <code>map</code> operation.</p>
<div class="highlight"><pre>ufo_us['year_month'] = ufo_us['date_occurred'].map(lambda x:
dt.date(x.year, x.month, 1))
</pre></div>
<blockquote>
<p><strong>Note</strong>: The authors approach this problem a little differently,
using R’s <code>strftime</code> function to turn the dates into a string of the form
<code>YYYY-MM</code>. I prefer to keep them numeric (it makes time series
plots more sensible), but either way works. My choice of the first
day of the month is arbitrary, and just serves to collect the dates into groups.</p>
</blockquote>
<p>Then we want to sum up the sightings by state and month. To do this,
I’ll use Pandas <code>groupby</code> method. <code>groupby</code>, as you’d expect, works like
<span class="caps">SQL</span>’s <code>GROUP BY</code> statement.</p>
<div class="highlight"><pre><span class="n">sightings_counts</span> <span class="o">=</span> <span class="n">ufo_us</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s">'us_state'</span><span class="p">,</span>
<span class="s">'year_month'</span><span class="p">])[</span><span class="s">'year_month'</span><span class="p">]</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
</pre></div>
<p>You can almost read this statement as an <code>SQL</code> query:
<code>SELECT COUNT(year_month) GROUP BY us_state, year_month</code>.</p>
<p>The <code>groupby</code> method applied to the data frame results in a
<code>DataFrameGroupBy</code> object, which isn’t much to look at but contains all
the information we need to perform calculations by groups of the
variables we passed to the method. Calling the <code>year_month</code> column
results in a similar <code>SeriesGroupBy</code> object. Finally, calling the
<code>count</code> method counts how many non-null observations of <code>year_month</code>
there are in each level. The final output is a Series of the counts with
a hierarchal index of the groupby variables.</p>
<p>To aggregate their data in R, the authors use the <code>ddply</code> function,
which provides similar groupby-type functionality. I find the <code>plyr</code>
functions less intuitive and expressive than Pandas’ syntax. But, the
<code>plyr</code> functions are a big improvement over R’s <code>apply</code> functions for
complicated calculations.</p>
<p>As the authors do on p. 22, let’s check out the first few Alaska sightings.</p>
<div class="highlight"><pre><span class="k">print</span> <span class="s">'First few AK sightings in data:'</span>
<span class="k">print</span> <span class="n">sightings_counts</span><span class="o">.</span><span class="n">ix</span><span class="p">[</span><span class="s">'ak'</span><span class="p">]</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">6</span><span class="p">)</span>
</pre></div>
<p>This spits out:</p>
<div class="highlight"><pre>First few AK sightings in data:
year_month
1990-01-01 1
1990-03-01 1
1990-05-01 1
1993-11-01 1
1994-02-01 1
1994-11-01 1
</pre></div>
<p>Note that I have one more observation than the authors do–February 1994.
As discussed in <a href="../ml4h-ch1-p2.html">Part 2</a>, the authors’ cleaning methodology is going
to cut any observations where the <span class="caps">U.S.</span> city part of the location data
has commas in it. My methodology won’t lose those observations. That
seems to be what’s happened here. Looking at that record with:</p>
<div class="highlight"><pre>print 'Extra AK sighting, no on p. 22:'
print ufo_us[(ufo_us['us_state'] == 'ak') &
(ufo_us['year_month'] == dt.date(1994, 2, 1))] \\
[['year_month','location']]
</pre></div>
<p>shows that indeed, my extra observation has a comma in the city record:</p>
<div class="highlight"><pre>Extra AK sighting, no on p. 22:
year_month location
5508 1994-02-01 Savoonga,St. Lawrence Island, AK[/sourcecode]
</pre></div>
<h2>Indexing tricks</h2>
<p>When we perform the <code>groupby</code> calculations, the resulting series is
missing rows where there were no <span class="caps">UFO</span> sightings in a state/month. This
makes sense of course – <code>groupby</code> goes through the data, finds all the
state/month combinations, and turns them into discrete levels within
which to perform calculations. If there are no sightings in a state in a
month, <code>groupby</code> won’t know to turn that combination into a level.</p>
<p>So, basically, we want to add those levels back into the data and set
the associated sightings count to zero. There are two ways to do this in
Pandas. The first uses Pandas’ <code>reindex</code> methods. I’ll create a “full”
index with every combination of states and months:</p>
<div class="highlight"><pre><span class="n">ym_list</span> <span class="o">=</span> <span class="p">[</span><span class="n">dt</span><span class="o">.</span><span class="n">date</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">m</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1990</span><span class="p">,</span> <span class="mi">2011</span><span class="p">)</span>
<span class="k">for</span> <span class="n">m</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">13</span><span class="p">)</span>
<span class="k">if</span> <span class="n">dt</span><span class="o">.</span><span class="n">date</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">m</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> \<span class="o"><=</span> <span class="n">dt</span><span class="o">.</span><span class="n">date</span><span class="p">(</span><span class="mi">2010</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">1</span><span class="p">)]</span>
<span class="n">full_index</span> <span class="o">=</span> <span class="nb">zip</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">us_states</span> \<span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">ym_list</span><span class="p">)),</span> <span class="n">ym_list</span> \<span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">us_states</span><span class="p">))</span>
<span class="n">full_index</span> <span class="o">=</span> <span class="n">MultiIndex</span><span class="o">.</span><span class="n">from_tuples</span><span class="p">(</span><span class="n">full_index</span><span class="p">,</span> <span class="n">names</span> <span class="o">=</span>
<span class="p">[</span><span class="s">'states'</span><span class="p">,</span> <span class="s">'year_month'</span><span class="p">])[</span><span class="o">/</span><span class="n">sourcecode</span><span class="p">]</span>
</pre></div>
<p>The first line is just a list comprehension that creates a list of all
the months in the data, from January 1990 to August 2010. The second
line creates 51×248 tuples of (state, month) pairs. (I created the list
of states, <code>us_states</code>, in <a href="../ml4h-ch1-p2.html">Part 2</a>.) The third line creates a Pandas
hierarchical index out of these tuples. Hierarchical indices in Pandas
can take names that label the levels of the index.</p>
<p>Next, I’ll reindex the <code>sightings_counts</code> series with this full index.
Pandas will conform the dataset to the new index we give it, dropping
elements whose index level is not in the new index, and making elements
for new index levels not in the original. By default Pandas fills in
these new elements with <code>NA</code>, but we can tell it to fill these values
with zero, and end up with the series we’re looking for.</p>
<div class="highlight"><pre><span class="n">sightings_counts</span> <span class="o">=</span> <span class="n">sightings_counts</span><span class="o">.</span><span class="n">reindex</span><span class="p">(</span><span class="n">full_index</span><span class="p">,</span> <span class="n">fill_value</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
</pre></div>
<h2>Stacking and unstacking data</h2>
<p>There’s another way to get the full time series out of the groupby
calculations. Instead of creating the full index of state/month
combinations, I can use a trick using Pandas <code>stack</code> and <code>unstack</code>
methods. <code>stack</code> and <code>unstack</code> turn data from wide to long and vice
versa, similar to the <code>melt</code> and <code>cast</code> methods in R’s <code>reshape2</code>
package.</p>
<p>The idea is to first widen (<code>unstack</code>) the data, so that we have states
as columns and months as rows. This will force the data to have the
248×51 entries we’re looking for (assuming that there’s a sighting in
at least one state every month between January 1990 and August 2010).
For the entries in this data frame where there are no
sightings–state/months not present in the long data–Pandas will fill in
<code>NA</code>. I’ll tell Pandas to fill it with zero instead, and then <code>stack</code>
the data again to put it back in long form. Since there is now a number
(sometimes zero) for every state/month pair, this new long dataset will
have all the rows we need. Here’s the code:</p>
<div class="highlight"><pre><span class="n">sightings_counts1</span> <span class="o">=</span> <span class="n">ufo_us</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s">'us_state'</span><span class="p">,</span> <span class="s">'year_month'</span><span class="p">])[</span><span class="s">'year_month'</span><span class="p">]</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="n">sightings_counts1</span> <span class="o">=</span> <span class="n">sightings_counts1</span><span class="o">.</span><span class="n">unstack</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">stack</span><span class="p">()</span>
</pre></div>
<p>Let’s check that we get the same dataset from both methods:</p>
<div class="highlight"><pre><span class="c"># Check they're the same shape and values.</span>
<span class="k">print</span> <span class="s">'Shape using handmade MultiIndex:'</span><span class="p">,</span> <span class="n">sightings_counts</span><span class="o">.</span><span class="n">shape</span>
<span class="k">print</span> <span class="s">'Shape using unstack/stack method:'</span><span class="p">,</span> <span class="n">sightings_counts1</span><span class="o">.</span><span class="n">shape</span>
<span class="k">print</span> <span class="s">'Sum absolute difference:'</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">abs</span><span class="p">(</span><span class="n">sightings_counts1</span> <span class="o">-</span>
<span class="n">sightings_counts</span><span class="p">))</span>
</pre></div>
<p>I check the sum-of-absolute-differences between the series, instead of
checking for strict equality, to give some leeway for floating point
error (even though these should be integers, there might be some type
conversion that happens through these methods). Either way, looks like
we have the same result from both methods:</p>
<div class="highlight"><pre>Shape using handmade MultiIndex: (12648,)
Shape using unstack/stack method: (12648,)
Sum absolute difference: 0.0
</pre></div>
<h2>Conclusion</h2>
<p>I’ve got the data just how I want it to plot time series of <span class="caps">UFO</span>
sightings by state. There were actually very few lines of code in this
part. But those few lines of code were doing a lot of work, and
represented one of the toughest parts of working with data: getting it
in the right shape. It wasn’t long ago that reshaping data was always
and everywhere a huge hassle. It still is in some languages (<em>*cough*
<span class="caps">SAS</span> *cough* Stata *cough*</em>). The combination of hierarchical
indexing and <code>stack</code> and <code>unstack</code> methods in Pandas make doing this in
Python actually pleasant.</p>
<p>I’m finally going to wrap up Chapter 1 in the next part, in which I
create a plot to match the authors’ trellis plot of sightings time
series by state. It’s going to be a real Matplotlib adventure.</p>Shades of Time: I don’t buy it, and that’s why it’s so great.2012-04-22T04:00:00-04:00Carltag:slendermeans.org,2012-04-22:shades-of-time.html<p>Over the weekend <a href="http://drewconway.com/zia">Drew Conway</a> posted about a data analysis project
he’d just completed called <a href="http://www.drewconway.com/zia/?p=2874"><em>Shades of Time</em></a>. Very briefly, he took a
<a href="http://www.reddit.com/r/datasets/comments/s0fld/all_time_magazine_covers_march_1923_to_march_2012/">dataset</a> of Time magazine covers from 1923 to March 2012, then used
some Python libraries to identify the faces in the covers and identify
the skin tone of each face. The result is a really great
interactive <a href="http://labs.drewconway.com/time/">visualization</a> implemented in <a href="http://mbostock.github.com/d3/">d3.js</a>.</p>
<p>From looking at this data, Drew, with some caveats, observes that “it
does appear that the variance in skin tones have [sic] changed over
time, and in fact the tones are getting darker.” He also notes that
there are more faces on covers in later years.</p>
<h2>Why I don’t believe it</h2>
<p>There’s no real statistical testing done here–no formal quantification
how skin-tone representation on covers is changing over time. Instead, I
think he’s drawing his conclusion on the vizualization alone, especially
the scatterplot in the bottom panel that seems to show more darker tones
appearing later in the date (starting in the 70’s, the skin-tone
dispersion in his data starts to increase).</p>
<p>He notes that there are difficulties in both identifying faces and skin
tones. After going through his analysis, I think these algorithms are
fragile enough, and the categorization of faces and skin tones is poor
enough, that I don’t really buy his conclusion that cover face diversity
is increasing.</p>
<p>For example, I reviewed many of the data classified with a dark skin
tone that seemed to be contributing to the visual impression of
increasing diversity. A good number of them weren’t faces at all, but
objects like guns, or parts of the word “<span class="caps">TIME</span>.”</p>
<p>Many others were famous white guys. Here’s a list I made from my cursory review:</p>
<ol>
<li>James Taylor (1971)</li>
<li>Archie Bunker/Carrol O’Connor (1973)</li>
<li>Joni Mitchell (1974)</li>
<li>Gerald Ford (1974, 1975)</li>
<li>Francisco Franco (1975)</li>
<li>Jimmy Carter (1976)</li>
<li>Queen Elizabeth (1976)</li>
<li>John Irving (1981)</li>
<li>Ronald Reagan (1985)</li>
<li>Willem Defoe, Charlie Sheen (1987)</li>
<li>Ollie North (1987)</li>
<li>Dan Rather (1988)</li>
<li>Michael Eisner (1988)</li>
<li>Statue In Congress (1990)</li>
<li>Garth Brooks (1992)</li>
<li>Roger Keith Coleman (1992)</li>
<li>Serbian Detention Camp Prisoners (1992)</li>
<li>Michael Chrichton (1995)</li>
<li>Bill Clinton (I know he’s the first black president, but I don’t think that should count) (1998)</li>
<li>Monica Lewinsky (1998)</li>
<li>John Travolta (1998)</li>
<li>Slobodan Milosevic (1999)</li>
<li>Ted Kaczynski (1999)</li>
<li>John McCain (2000)</li>
<li>Jerry Levin (2000)</li>
<li>George Bush (2000)</li>
<li>Francis Collins (2000)</li>
<li>Yoda (2002)</li>
<li>Trent Lott (2002)</li>
<li>Joe Wilson (2003)</li>
<li>Brad Pitt (2004)</li>
<li>John Kerry (2004)</li>
<li>George Bush (2004)</li>
<li>Bono (2006)</li>
<li>Bill Gates (2006)</li>
<li>Jesus Christ (2006)</li>
<li>John McCain (2006)</li>
<li>Rick Warren (2008)</li>
<li>Sarah Palin (2008)</li>
<li>Lloyd Blankfein (2009)</li>
<li>Tom Hanks (2010)</li>
<li>Jonathan Franzen (2010)</li>
<li>George Washington (2010)</li>
</ol>
<p>Now, no classification algorithm is perfect, and these covers are
complicated, heterogeneous inputs. But just from eyeballing it, this one
seems so inaccurate on this data, that I don’t trust that the observed
dispersion is the result of more correctly classified darker faces on covers.</p>
<h2>Why it’s still awesome</h2>
<p>While I don’t think the classification process here is accurate enough
to let us draw inferences about skin tone diversity, the fact that I
could come to this conclusion after 30 minutes of poking around on a web
site really says some interesting things about the process and
presentation of the project.</p>
<p>For one, I think it’s a fantastic use of dynamic visualization. I don’t
think any aesthetic aspect of it is novel or noteworthy, instead I think
it’s innovative on a more meta level. Often times we think for
visualizations as serving one of two processes. The first is pre-model:
exploration of raw data to suggest questions, patterns, or models. The
second is post-model: presentation of results or model diagnostics.</p>
<p>I’ve been skeptical of d3 and similar frameworks, because I’ve rarely
seen dynamic or interactive graphs that do a much better job at these
two types of tasks than static graphs. At least not so much better as to
justify the added costs of producing them and delivering them to an
audience. Also, a lot of what I’ve seen that’s been represented as cool
stuff you can do with d3–or Processing, or whatever–is mostly pretty
junk; stuff like busy stream graphs and chord graphs and other things
I’d put in the high-effort/low-reward quadrant of Kaiser Fung’s
<a href="http://statisticsforum.wordpress.com/2011/07/31/one-difference-between-statistical-graphics-and-infoviz-is-the-return-on-effort/">return-on-effort matrix</a>.</p>
<p>The visualization for <em>Shades of Time</em>, though, is impressive to me
because it’s not really exploring raw data, or presenting
results–instead it’s illustrating the <em>process</em> of analyzing the data.
To get the list above, I started from the time series chart at the
bottom that seemed to show increasing diversity. Then I noted the points
in that chart that I felt were most influencing that conclusion. I could
then find them in the scrolling chart on the left, click, see on the
right panel what raw data (what image on what cover) generated that
point, and determine whether the classifier was giving a meaningful result.</p>
<p>After going through it long enough, I decided there really wasn’t enough
meaningful output coming from the classifier for me to comfortably
believe Drew’s observation. Nonetheless, I think it’s incredibly novel
and useful to have a visualization that lets me so easily do a
mini-replication of the analysis. This one lets you walk through the
major steps, from raw data (the covers in the right panel) to
quantification/classification (the skin tone tiles in the left panel) to
aggregation and interpretation (the time series scatter plot on the bottom).</p>
<p>It really makes me rethink some of the possibilities of interactive
graphics. This isn’t just a <a href="http://www.nytimes.com/interactive/2008/02/23/movies/20080223_REVENUE_GRAPHIC.html">stream graph of box office receipts</a> or
the <a href="http://www.babynamewizard.com/">Baby Name Wizard</a>, which are mostly just raw data explorers. I
think it suggests a whole different application and conceptual framework
for interactive graphics. That is, how do we illustrate to an audience
the <em>process</em> by which we went from raw data to conclusions, and let
them follow along and investigate that process?</p>Machine Learning for Hackers Chapter 1, Part 3: Simple summaries and plots.2012-04-19T04:00:00-04:00Carltag:slendermeans.org,2012-04-19:ml4h-ch1-p3.html<h2>Introduction</h2>
<p>See <a href="../ml4h-ch1-p1.html">Part 1</a> and <a href="../ml4h-ch1-p2.html">Part 2</a> for previous work.</p>
<p>In this part, I’ll replicate the authors’ exploration of the <span class="caps">UFO</span>
sighting dates via histograms. The key takeaways:</p>
<ol>
<li>The plotting methods in Pandas are easy and useful.</li>
<li>Unlike R <code>Dates</code>, Python <code>datetimes</code> aren’t compatible with a lot of
mathematical operations. We’ll see that you can’t apply quantile or
histogram methods to them directly.</li>
</ol>
<h2>Quick data summary methods and datetime complications.</h2>
<p>For those playing along at home, I’m at p. 19 of the book. The first
thing the authors do here is get a statistical summary of the sighting
dates in the data, which are recorded in the <code>DateOccurred</code> variable
(which I’ve named <code>date_occurred</code> in my code). This is easy in R using
the <code>summary</code> function, which provides the minimum, maximum, and
quartiles of the data by default.</p>
<p>Pandas has similar functionality, in a method called <code>describe</code>, which
gives the same for numeric variables, plus the count of non-null values
and the mean and standard deviation. For example:</p>
<div class="highlight"><pre><span class="n">s1</span> <span class="o">=</span> <span class="n">Series</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">100</span><span class="p">))</span>
<span class="k">print</span> <span class="n">s1</span><span class="o">.</span><span class="n">describe</span><span class="p">()</span>
</pre></div>
<p>outputs what we’d expect from a series of randomly-generated standard normals:</p>
<div class="highlight"><pre>count 100.000000
mean -0.149274
std 1.011230
min -2.521374
25% -0.790867
50% -0.167813
75% 0.596617
max 2.231157
</pre></div>
<p>If we apply this to the <code>date_occurred</code> series, though, we get something different.</p>
<div class="highlight"><pre><span class="n">ufo_us</span><span class="p">[</span><span class="s">'date_occurred'</span><span class="p">]</span><span class="o">.</span><span class="n">describe</span><span class="p">()[</span><span class="o">/</span><span class="n">sourcecode</span><span class="p">]</span>
</pre></div>
<p>results in:</p>
<div class="highlight"><pre>count 52134
unique 8786
top 1999-11-16 00:00:00
freq 185
</pre></div>
<p>because Pandas treats <code>datetime</code> series as non-numeric variables (which
they technically are).</p>
<blockquote>
<p><strong>Note</strong>: To compute quantiles for numeric series, Pandas uses SciPy’s
<code>scoreatpercentile</code> function, which in turn relies on a simple linear
interpolation function (<code>_interpolate</code> in <code>scipy.stats</code>). <code>datetime</code>
objects don’t play well with this function, since when you take the
difference between two <code>datetimes</code> you don’t get a number, but instead
a <code>timedelta</code> tuple, that you can’t perform mathematical operations on
until you unpack it. The <code>min</code> and <code>max</code> methods will work on
<code>datetimes</code>, though.</p>
</blockquote>
<p>We can get around this by extracting the years from the variable, which
will be integers.</p>
<div class="highlight"><pre>years = ufo_us['date_occurred'].map(lambda x: x.year)
print years.describe()
</pre></div>
<p>results in:</p>
<div class="highlight"><pre>count 52134.000000
mean 2000.572237
std 10.889045
min 1400.000000
25% 1999.000000
50% 2003.000000
75% 2007.000000
max 2010.000000
</pre></div>
<p>which is a little precise for year data, but how is Pandas to know? At
any rate, we come to the same conclusion as the authors: that three
quarters of the sightings occurred in 1999 or later, and the earliest
date in the data is in 1400. (If we check, we’ll see this sighting
occurred in Texas, so it’s certainly an error).</p>
<p>Plotting histograms</p>
<p>The authors then plot a histogram of the dates in the data. Like with
<code>quantile</code>, the <code>hist</code> plot method (which just calls a Matplotlib
histogram) doesn’t work with <code>datetime</code> data. If we try</p>
<div class="highlight"><pre><span class="n">ufo_us</span><span class="p">[</span><span class="s">'date_occurred'</span><span class="p">]</span><span class="o">.</span><span class="n">hist</span><span class="p">()</span>
</pre></div>
<p>we’ll get an error complaining that <code>datetime</code> can’t be compared with
<code>float</code>. So, I’ll just work with the years instead of the full
<code>datetime</code>. I can generate the plot with a call to the series’ <code>hist</code>
method, one of several plotting methods for Pandas objects that makes it
extremely easy to get quick plots of them.</p>
<div class="highlight"><pre><span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">()</span>
<span class="n">years</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">bins</span> <span class="o">=</span> <span class="p">(</span><span class="n">years</span><span class="o">.</span><span class="n">max</span><span class="p">()</span> <span class="o">-</span> <span class="n">years</span><span class="o">.</span><span class="n">min</span><span class="p">())</span><span class="o">/</span><span class="mf">30.</span><span class="p">,</span> <span class="n">fc</span> <span class="o">=</span> <span class="s">'steelblue'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Histogram of years with U.S. UFO sightings</span><span class="se">\n</span><span class="s">All years in data'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">'quick_hist_all_years.png'</span><span class="p">)</span>
</pre></div>
<p>I explicitly set the bins to match the ggplot defaults used in the book.
We get this plot, which basically matches the authors’:</p>
<p><a href="../images/quick_hist_all_years2.png">
<img src="../images/quick_hist_all_years2.png" width=450px />
</a></p>
<p>The authors then focus on only data after 1990, using R’s <code>subset</code>
function to remove earlier observations from the data. This is
straightforward in Pandas. I’ll also extract another series with the
years of this subset of dates.</p>
<div class="highlight"><pre><span class="n">ufo_us</span> <span class="o">=</span> <span class="n">ufo_us</span><span class="p">[</span><span class="n">ufo_us</span><span class="p">[</span><span class="s">'date_occurred'</span><span class="p">]</span> \<span class="o">>=</span> <span class="n">dt</span><span class="o">.</span><span class="n">datetime</span><span class="p">(</span><span class="mi">1990</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)]</span>
<span class="n">years_post90</span> <span class="o">=</span> <span class="n">ufo_us</span><span class="p">[</span><span class="s">'date_occurred'</span><span class="p">]</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">year</span><span class="p">)</span>
</pre></div>
<p>After subsetting, the authors have 46,347 rows left in the data. Looking
at the <code>shape</code> attribute of the subsetted data frame, we have 46,780.
We’ve picked up some observations from D.C., as well as from our more
expansive method of finding <span class="caps">U.S.</span> locations.</p>
<p>Another histogram of the subset data looks similar to the authors’ chart
on p. 23, but since I’m only histogramming over years, I lose some resolution.</p>
<p><a href="../images/quick_hist_post90.png">
<img src="../images/quick_hist_post90.png" width=450px />
</a></p>
<p>While the histogram is fine for a quick look at the distribution of
dates, it’s not a very accurate picture of how sightings evolve over
time: the binning really destroys too much information. It makes more
sense just to do a time-series plot of total sightings by date. We can
do that with some data aggregation and an easy call to the <code>plot</code> method
in Pandas.</p>
<div class="highlight"><pre><span class="n">post90_count</span> <span class="o">=</span> <span class="n">ufo_us</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'date_occurred'</span><span class="p">)[</span><span class="s">'date_occurred'</span><span class="p">]</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">()</span>
<span class="n">post90_count</span><span class="o">.</span><span class="n">plot</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Number of U.S. UFO sightings</span><span class="se">\\</span><span class="s">nJanuary 1990 through August 2010'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">'post90_count_ts.png'</span><span class="p">)</span>
</pre></div>
<p>This uses Pandas’ awesome <code>groupby</code> method, which I’ll discuss more in
the next part. We get the following figure:</p>
<p><a href="../images/post90_count_ts.png">
<img src="../images/post90_count_ts.png" width=450px />
</a></p>
<p>Based on this graph, it looks like there’s a seasonal component to
sightings, which wasn’t apparent in the histogram. There are also a few
large spikes, especially around the end of the millenium.</p>
<h2>Conclusion</h2>
<p>This part was a relatively easy one. The next part will focus on data
aggregation using <code>groupby</code> and <code>reindex</code> methods. Then I’ll wrap up
with with replicating the authors’ trellis graph.</p>Machine Learning for Hackers Chapter 1, Part 2: Cleaning date and location data2012-04-18T04:00:00-04:00Carltag:slendermeans.org,2012-04-18:ml4h-ch1-p2.html<h2>Introduction</h2>
<p>In the <a href="../ml4h-ch1-p1.html">previous post</a>, I loaded the raw <span class="caps">UFO</span> data into a Pandas data
frame after cleaning up some irregularities in the text file. Since
we’re ultimately concerned with analyzing <span class="caps">UFO</span> sightings over time and
space, the next step is to clean those variables and prepare them for
analysis and vizualization.</p>
<p>Some Python techniques to note in this part are:</p>
<ul>
<li>Like in the last part, Python string methods are going to come in
really handy, and be a simple, expressive solution to a lot of problems.</li>
<li>When those aren’t enough, Python has a pretty straightforward set of
functions for implementing regular expressions.</li>
<li>The <code>map()</code> method in Pandas can be used to “vectorize” functions
along a Series (i.e. a data frame column) and is similar to R’s
<code>apply</code>. In general, using a NumPy <code>ufunc</code> (vectorized function) is
preferable, but not all operations can be expressed in <code>ufunc</code>s.
This is especially true for non-numeric operations, such as for
strings or dates.</li>
</ul>
<h2>Cleaning dates: mapping and subsetting.</h2>
<p>The first two columns of the data are dates in <code>YYMMDDD</code> format, and
Pandas imported them as integers. R has a function, <code>as.Date</code> that will
operate on a vector of date strings, converting them to numeric dates.
In Python, the <code>strptime</code> function in the <code>datetime</code> module performs the
same function, but it not vectorized the way <code>as.Date</code> is. (Note that R
also has a <code>strptime</code> that converts date strings to <span class="caps">POSIX</span> class object).
Therefore, we have to use the <code>map</code> method.</p>
<div class="highlight"><pre><span class="k">def</span> <span class="nf">ymd_convert</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd">Convert dates in the imported UFO data.</span>
<span class="sd">Clean entries will look like YYYMMDD. If they're not clean, return NA.</span>
<span class="sd">'''</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">cnv_dt</span> <span class="o">=</span> <span class="n">dt</span><span class="o">.</span><span class="n">datetime</span><span class="o">.</span><span class="n">strptime</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="s">'%Y%m</span><span class="si">%d</span><span class="s">'</span><span class="p">)</span>
<span class="k">except</span> <span class="ne">ValueError</span><span class="p">:</span>
<span class="n">cnv_dt</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">nan</span>
<span class="k">return</span> <span class="n">cnv_dt</span>
<span class="n">ufo</span><span class="p">[</span><span class="s">'date_occurred'</span><span class="p">]</span> <span class="o">=</span> <span class="n">ufo</span><span class="p">[</span><span class="s">'date_occurred'</span><span class="p">]</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">ymd_convert</span><span class="p">)</span>
<span class="n">ufo</span><span class="p">[</span><span class="s">'date_reported'</span><span class="p">]</span> <span class="o">=</span>
<span class="n">ufo</span><span class="p">[</span><span class="s">'date_reported'</span><span class="p">]</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">ymd_convert</span><span class="p">)</span>
</pre></div>
<p>Notice that <code>map</code> here is like R’s <code>apply</code> function (this is a little
confusing, since Python also has an <code>apply</code> method that is <em>not</em> like
R’s). Since series—columns in Pandas data frames—are just NumPy
<code>ndarrays</code> underneath, only NumPy <code>ufuncs</code> will operate on them in a
vectorized (fast, elementwise) fashion. Base Python functions, and any
more complicated functions you create from them, will have to be
explicitly mapped. This is a little different from R, where, since the
fundamental object in the language is the vector, functions are more
likely vectorized than not. Nonetheless, NumPy <code>ufuncs</code> do cover the
gamut of mathematical operations, and for other cases, the <code>map</code> method
is easy enough to implement.</p>
<p>Then we just get rid of the rows with one date or the other not in
proper <code>YYYMMDD</code> format.</p>
<div class="highlight"><pre><span class="c"># Get rid of the rows that couldn't be conformed to datetime.</span>
<span class="n">ufo</span> <span class="o">=</span> <span class="n">ufo</span><span class="p">[(</span><span class="n">notnull</span><span class="p">(</span><span class="n">ufo</span><span class="p">[</span><span class="s">'date_reported'</span><span class="p">]))</span> <span class="o">&</span>
<span class="p">(</span><span class="n">notnull</span><span class="p">(</span><span class="n">ufo</span><span class="p">[</span><span class="s">'date_occurred'</span><span class="p">]))]</span>
</pre></div>
<p>The subsetting of the data frame is done by indexing it with a boolean
vector. Since the <code>df[ ]</code> operation returns rows, the</p>
<p>One can also subset an R data frame this way. R though, also has a
<code>subset</code> function, with the syntax:</p>
<div class="highlight"><pre><span class="n">ufo</span> <span class="o">=</span> <span class="n">ufo</span><span class="p">[</span><span class="err">!</span><span class="ow">is</span><span class="o">.</span><span class="n">na</span><span class="p">(</span><span class="n">ufo</span><span class="p">[</span> <span class="p">,</span> <span class="s">'date.reported'</span><span class="p">])</span> <span class="o">&</span> <span class="err">!</span><span class="ow">is</span><span class="o">.</span><span class="n">na</span><span class="p">(</span><span class="n">ufo</span><span class="p">[</span> <span class="p">,</span> <span class="s">'date.occurred'</span><span class="p">]),</span> <span class="p">]</span>
</pre></div>
<p>being equivalent to:</p>
<div class="highlight"><pre>ufo <span class="o">=</span> <span class="kp">subset</span><span class="p">(</span>ufo<span class="p">,</span> where <span class="o">=</span> <span class="o">!</span><span class="kp">is.na</span><span class="p">(</span>date_reported<span class="p">)</span> <span class="o">&</span> <span class="o">!</span><span class="kp">is.na</span><span class="p">(</span>date.occurred<span class="p">))</span>
</pre></div>
<p>The general <code>subset</code> syntax is:
<code>df.new = subset(df.orig, where = condition, select = columns)</code>.
Since <code>subset</code> looks for the variables referenced in the <code>where</code> and
<code>select</code> arguments in the <code>df.orig</code> environment, there’s no need to call
them as <code>df.orig[ , 'var']</code> or <code>df.orig$var</code>. There are other useful
commands that work like this: <code>with</code>, <code>within</code>, and <code>transform</code>, for example.</p>
<p>I find the <code>subset</code> function in R more expressive and easier to read
than the boolean masking method, and I miss there being a Pandas equivalent.</p>
<h2>Cleaning locations: string functions and regular expressions</h2>
<p>Cleaning the date variables was relatively easy. Locations are trickier,
and the authors don’t do a particularly thorough job of it. (No knock on
them, reading several pages of text cleaning would be deadly boring, and
they’te just illustrating some techniques). I’ll suggest a slightly
better method that will pick up some extra data, but even that could
probably be improved if we were concerned about getting every bit of
information out of this dataset.</p>
<p>The authors assume that valid <span class="caps">U.S.</span> locations are going to be in “City,
<span class="caps">ST</span>” format (e.g., “Iowa City, <span class="caps">IA</span>”). Anything else is going to be dropped
as either an international record, or not worth cleaning.</p>
<p>They write a function that takes a location record and checks that it
fits this pattern by seeing if R’s <code>strsplit</code> function splits it into
two elements at a comma. If so, the function returns a vector containing
the two elements, otherwise it returns a vector with two <code>NAs</code> (though
not quite, see the note below). They then use R’s <code>lapply</code> to apply the
function elementwise, and collect the resulting vectors in a list. Then
there are some tricks to get the list into an <code>Nx2</code> matrix, and then put
each column of the matrix into a variable in the data frame as <code>USCity</code>
and <code>USState</code>.</p>
<blockquote>
<p><strong>Note</strong>: the authors wrap <code>strsplit</code> in <code>tryCatch</code> assuming that the
former will throw an error if there are no commas in the string. My
testing shows that’s not the case, and <code>strsplit</code> will just return the
original string. The <code>tryCatch</code> wrapper doesn’t have any effect, and
that line of code doesn’t appear to drop locations without commas as
the authors intend. This isn’t really a problem, since they later
subset on records with valid <span class="caps">U.S.</span> states, and that ultimately drops
the no-comma location records.</p>
</blockquote>
<p>It’s easy to write a similar function in Python, using the <code>split</code>
method of string objects.</p>
<div class="highlight"><pre><span class="k">def</span> <span class="nf">get_location</span><span class="p">(</span><span class="n">l</span><span class="p">):</span>
<span class="n">split_location</span> <span class="o">=</span> <span class="n">l</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">','</span><span class="p">)</span>
<span class="n">clean_location</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">split_location</span><span class="p">]</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">split_location</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">2</span><span class="p">:</span>
<span class="n">clean_location</span> <span class="o">=</span> <span class="p">[</span><span class="s">''</span><span class="p">,</span> <span class="s">''</span><span class="p">]</span>
<span class="k">return</span> <span class="n">clean_location</span>
</pre></div>
<p>This is near-direct translation of the authors’ <code>get.location</code> function.
Note the <code>strip</code> method and the list comprehension replace the <code>gsub</code>
function the authors use to remove beginning and trailing white space
from the extracted city and states.</p>
<p>But a quick look at the data shows that there are lots of valid <span class="caps">U.S.
</span>locations that will get dropped with this method. Specifically, the city
part of the location contains commas in many records, so the split
methods will return more than two elements and we will drop them as
invalid. Let’s check out some cases with the following code:</p>
<div class="highlight"><pre><span class="n">multi_commas</span> <span class="o">=</span> <span class="n">ufo</span><span class="p">[</span><span class="s">'location'</span><span class="p">]</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span> <span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s">','</span><span class="p">)</span> \<span class="o">></span> <span class="mi">1</span><span class="p">)</span>
<span class="k">print</span> <span class="s">'Number of entries w/ multiple commas'</span><span class="p">,</span> <span class="nb">sum</span><span class="p">(</span><span class="n">multi_commas</span><span class="p">)</span>
<span class="k">print</span> <span class="n">ufo</span><span class="p">[</span><span class="s">'location'</span><span class="p">][</span><span class="n">multi_commas</span><span class="p">][:</span><span class="mi">10</span><span class="p">][</span><span class="o">/</span><span class="n">sourcecode</span><span class="p">]</span>
</pre></div>
<p>This returns:</p>
<div class="highlight"><pre>Number of entries w/ multiple commas 1055
1473 Aquaduct (near, over desert, before entering California), CA
1985 Redding (northeast of, out over Millville, approximately), CA
2108 Farmington (SE of, deserted area, Hwy 44), NM
2160 Stouthill (community, nearest city 30 miles, TN), TN
2242 Highway 71 between Clearmont, Missouri and Maryville, Missou, MO
2257 Bayfield (near, Lake Superior, south shore), WI
2287 Unidentified object sig, (VIC, Australia),
2297 Garfield, (VIC, Australia),
2384 Northeast Cape AFS, St Lawrence Island,, AK
2458 Flisa, Solør, Hedemark (Norway),[/sourcecode]
</pre></div>
<p>So there are over a thousand location records with more than one comma,
and out of the first ten, seven are valid <span class="caps">U.S.</span> locations.</p>
<p>To save these records, I’ll try another method, using regular
expressions to search for locations that end with “, <span class="caps">ST</span>”-type patterns.
Since we’re going to ultimately use <code>map</code> to check this pattern for
every row in the data, I’ll <em>compile</em> the pattern first, which typically
speeds up repeated searches.</p>
<div class="highlight"><pre><span class="n">us_state_pattern</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">', [A-Z][A-Z]\$'</span><span class="p">,</span> <span class="n">re</span><span class="o">.</span><span class="n">IGNORECASE</span><span class="p">)</span>
</pre></div>
<p>Then, I’ll create a function that takes a location record as input, and
applies the regex search to it.</p>
<div class="highlight"><pre><span class="k">def</span> <span class="nf">get_location2</span><span class="p">(</span><span class="n">l</span><span class="p">):</span>
<span class="n">strip_location</span> <span class="o">=</span> <span class="n">l</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
<span class="n">us_state_search</span> <span class="o">=</span> <span class="n">us_state_pattern</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">strip_location</span><span class="p">)</span>
<span class="k">if</span> <span class="n">us_state_search</span> <span class="o">==</span> <span class="bp">None</span><span class="p">:</span>
<span class="n">clean_location</span> <span class="o">=</span> <span class="p">[</span><span class="s">''</span><span class="p">,</span> <span class="s">''</span><span class="p">]</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">us_city</span> <span class="o">=</span> <span class="n">strip_location</span><span class="p">[</span> <span class="p">:</span><span class="n">us_state_search</span><span class="o">.</span><span class="n">start</span><span class="p">()]</span>
<span class="n">us_state</span> <span class="o">=</span> <span class="n">strip_location</span><span class="p">[</span><span class="n">us_state_search</span><span class="o">.</span><span class="n">start</span><span class="p">()</span> <span class="o">+</span> <span class="mi">2</span><span class="p">:</span> <span class="p">]</span>
<span class="n">clean_location</span> <span class="o">=</span> <span class="p">[</span><span class="n">us_city</span><span class="p">,</span> <span class="n">us_state</span><span class="p">]</span>
<span class="k">return</span> <span class="n">clean_location</span><span class="p">[</span><span class="o">/</span><span class="n">sourcecode</span><span class="p">]</span>
</pre></div>
<p>To follow this, note that if the regex pattern isn’t found, then the
<code>search</code> method returns <code>None</code>, otherwise it returns a search object
with several useful attributes. One of them is <code>start</code>, which indicates
where in the string the pattern starts. To extract the city, we just
take all the characters in the string up to <code>start</code>. The state will
start 2 characters later (since we don’t want the comma or space in
front). The function, like the previous one, finally returns a two
element list with either a city and a state, or two blanks for records
that didn’t match the pattern.</p>
<p>I again use <code>map</code> to apply this function elementwise to the location column:</p>
<div class="highlight"><pre><span class="n">location_lists</span> <span class="o">=</span> <span class="n">ufo</span><span class="p">[</span><span class="s">'location'</span><span class="p">]</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">get_location2</span><span class="p">)</span>
</pre></div>
<p>This returns a series of two-element lists. I use list comprehensions to
extract the first and second elements out to individual lists, which I
assign to <code>us_city</code> and <code>us_state</code> variables in the data frame. It
sounds complicated, but in Python it’s just two fairly readable lines of code:</p>
<div class="highlight"><pre><span class="n">ufo</span><span class="p">[</span><span class="s">'us_city'</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="n">city</span> <span class="k">for</span> <span class="n">city</span><span class="p">,</span> <span class="n">st</span> <span class="ow">in</span> <span class="n">location_lists</span><span class="p">]</span>
<span class="n">ufo</span><span class="p">[</span><span class="s">'us_state'</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="n">st</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span> <span class="k">for</span> <span class="n">city</span><span class="p">,</span> <span class="n">st</span> <span class="ow">in</span> <span class="n">location_lists</span><span class="p">]</span>
</pre></div>
<p>The last step in cleaning the location data is to weed out any locations
that fit the “City, <span class="caps">ST</span>” pattern, but were not in <span class="caps">U.S.</span> states–Canadian
provinces for example. The authors do this in a straightforward way by
making a list of the 50 <span class="caps">U.S.</span> states and using R’s <code>match</code> function to
see where the <span class="caps">U.S.</span> state variable matches a state in the list. They then
subset the data frame to records where there is a match.</p>
<blockquote>
<p><strong>Note</strong>: The authors leave <span class="caps">D.C.</span> out of the list of states. It looks
like there are about 90 records with <span class="caps">D.C.</span> in the state column.
Unfortunately a couple of these aren’t Washington, D.C., but are South
American “Distrito Capitals.” I’ll add <span class="caps">D.C.</span> into the list and
subsequent analyses, keeping in mind there are a few false positives.
(This may be true for other states as well, like I said at the start,
this cleaning isn’t 100% accurate.)</p>
</blockquote>
<p>NumPy has an equivalent to the <code>match</code> function, though the name is a
little more awkward: <code>in1d</code>. Below, I assign anything records in
<code>us_state</code> that doesn’t have a match in the state list a blank string,
then drop them out of the data.</p>
<div class="highlight"><pre><span class="n">ufo</span><span class="p">[</span><span class="s">'us_state'</span><span class="p">][</span><span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">in1d</span><span class="p">(</span><span class="n">ufo</span><span class="p">[</span><span class="s">'us_state'</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">us_states</span><span class="p">)]</span> <span class="o">=</span> <span class="s">''</span>
<span class="n">ufo</span><span class="p">[</span><span class="s">'us_city'</span><span class="p">][</span><span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">in1d</span><span class="p">(</span><span class="n">ufo</span><span class="p">[</span><span class="s">'us_state'</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">us_states</span><span class="p">)]</span> <span class="o">=</span> <span class="s">''</span>
<span class="n">ufo_us</span> <span class="o">=</span> <span class="n">ufo</span><span class="p">[</span><span class="n">ufo</span><span class="p">[</span><span class="s">'us_state'</span><span class="p">]</span> <span class="o">!=</span> <span class="s">''</span><span class="p">]</span>
</pre></div>
<p>The <code>to_list</code> is necessary because Pandas requires a list argument to
<code>[ ]</code>, and <code>in1d</code> returns a NumPy array.</p>
<p>And that’s that. In the next post I’ll start exploring the data graphically.</p>Machine Learning for Hackers Chapter 1, Part 1: Loading data2012-04-14T04:00:00-04:00Carltag:slendermeans.org,2012-04-14:ml4h-ch1-p1.html<h2>Preface</h2>
<p>This is my first <em>Will it Python?</em> post. These posts document
my experiences trying to port complete and interesting R projects to
Python. I’m beginning by going through the recently published <a href="http://shop.oreilly.com/product/0636920018483.do"><em>Machine
Learning for Hackers</em></a> (<span class="caps">MLFH</span>) by <a href="http://www.drewconway.com">Drew Conway</a> and <a href="http://johnmyleswhite.com">John Miles
White</a>.</p>
<p>More information on the posts is <a href="../pages/will-it-python.html">here</a>, and archives are <a href="../category/will-it-python.html">here</a>.</p>
<h2>Introduction</h2>
<p>The first chapter of <span class="caps">MLFH</span> is a gentle introduction to loading,
manipulating and graphing data in R. To keep the tutorial interesting,
the authors have found a fun dataset of <a href="https://github.com/johnmyleswhite/ML_for_Hackers/tree/master/01-Introduction/data/ufo"><span class="caps">UFO</span> sightings</a> to work through.</p>
<p>Since this chapter is mainly devoted to loading and manipulating data, a
lot of the R functionality they exploit is going to have an analog in
<a href="http://pandas.pydata.org/">Pandas</a>. Even though there’s not too much exciting going on in this
chapter, it’s a great way to explore how basic data tasks get done in
Python. It turns out there are some interesting differences between how
R and Python handle even this simple stuff.</p>
<p>In this first post, I’ll focus on just getting the data into the work
environment. The complete code for the chapter is located in a Github
repo, <a href="https://github.com/carljv/Will_it_Python/tree/master/MLFH/CH1">here</a>.</p>
<h2>Data with inconsistent column lengths: break or compensate?</h2>
<p>The raw data is contained in a tab-separated file and the authors use
R’s <code>read.delim()</code> function to read it into an R dataframe. The data seem
to load smoothly, and there are no errors or warnings. There are no
headers in the data, so the authors set the <code>headers</code> argument
of <code>read.delim()</code> to <code>FALSE</code> and name the columns of dataframe after
it’s loaded.</p>
<p>The same procedure in Python uses the <code>read_table()</code> function in Pandas:</p>
<div class="highlight"><pre><span class="n">ufo</span> <span class="o">=</span> <span class="n">read_table</span><span class="p">(</span><span class="s">'data/ufo/ufo_awesome.tsv'</span><span class="p">,</span> <span class="n">sep</span> <span class="o">=</span> <span class="s">'</span><span class="se">\t</span><span class="s">'</span><span class="p">,</span>
<span class="n">na_values</span> <span class="o">=</span> <span class="s">''</span><span class="p">,</span> <span class="n">header</span> <span class="o">=</span> <span class="bp">None</span><span class="p">)</span>
</pre></div>
<p>This, though, will raise an exception, complaining that there are the
“wrong number of columns.” R loaded the data without complaint, so
what’s going on?</p>
<p>It turns out that <code>read_table()</code> is right to complain. Let’s use
Python’s basic file <span class="caps">IO</span> to read each line of the file, and separate the
line into columns by splitting it at tab characters. We’d expect each
line to have six columns. As soon as we hit a line that doesn’t, I’ll
break the line-reading loop, and print out the line number and the
columns it was split into. This will tell us where the first (if any)
bad line is in the file, and give a look at what’s wrong with it.</p>
<div class="highlight"><pre><span class="n">inpath</span> <span class="o">=</span> <span class="s">'data/ufo/ufo_awesome.tsv'</span>
<span class="n">inf</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">inpath</span><span class="p">,</span> <span class="s">'r'</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">line</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">inf</span><span class="p">):</span>
<span class="n">splitline</span> <span class="o">=</span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">'</span><span class="se">\\</span><span class="s">t'</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">splitline</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">6</span><span class="p">:</span>
<span class="n">first_bad_line</span> <span class="o">=</span> <span class="n">splitline</span>
<span class="k">print</span> <span class="s">"First bad row:"</span><span class="p">,</span> <span class="n">i</span>
<span class="k">for</span> <span class="n">j</span><span class="p">,</span> <span class="n">col</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">first_bad_line</span><span class="p">):</span>
<span class="k">print</span> <span class="n">j</span><span class="p">,</span> <span class="n">col</span>
<span class="k">break</span>
<span class="n">inf</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
</pre></div>
<p>This code prints the following output:</p>
<div class="highlight"><pre>First bad row: 754
0 19950704
1 19950706
2 Orlando, FL
3
4 4-5 min
5 I would like to report three yellow oval lights which passed over
Orlando,Florida on July 4, 1995 at aproximately 21:30 (9:30 pm). These
were the sizeof Venus (which they passed close by). Two of them traveled
one after the otherat exactly the same speed and path heading
south-southeast. The third oneappeared about a minute later following
the same path as the other two. Thewhole sighting lasted about 4-5
minutes. There were 4 other witnesses oldenough to report the sighting.
My 4 year old and 5 year old children were theones who called my
attention to the &quot;moving stars&quot;. These objects moved
fasterthan an airplane and did not resemble an aircraft, and were moving
much slowerthan a shooting star. As for them being fireworks, their path
was too regularand coordinated. If anybody else saw this phenomenon,
please contact me at:
6 ler@gnv.ifas.ufl.edu
</pre></div>
<p>So we see that in row 754 of the file, we came across a line with seven
columns (six tabs). The sixth column of the data is a “long” description
of the <span class="caps">UFO</span> sighting, and here it looks like there was a tab character
within the long description, creating extraneous columns.</p>
<p>Why didn’t R have a problem with this line? We can see what happened if
we look on page 15 of the <span class="caps">MLFH</span>. There the authors show rows of the data
where the first column–the date of the sigthing–doesn’t match a date
format. The first instance of a bad observation in the first column of
the R data is <code>ler@gnv.ifas.ufl.edu</code>, which we just saw is actually the
first instance of a spurious seventh column. Apparently, <code>read.delim()</code>
is inferring the number of columns from the first few rows, then pushing
any extra columns to a new row.</p>
<p>I think I much prefer the Pandas behavior here to R’s. Even though R
actually did get the data loaded with no fuss, it ended up mangling it
pretty badly. Given the size of the dataset, the rarity of these bad
rows, and the authors’ cleaning process, it may not have mattered much
at the end of the analysis. But that’s not going to be true in every
case – and here, R isn’t even throwing a warning to indicate that
something might be fishy with the raw data.</p>
<p>Note though, that if the authors had used <code>read.delim()</code> with a
<code>col.names</code> argument, then R would have raised an error when it came
across a row with more columns than were indicated by the supplied list
of column names.</p>
<p>This is a pretty boring problem, but an important one. To sum up:</p>
<blockquote>
<p><strong>Lesson 1</strong>: R’s <code>read.delim()</code> without either <code>header = TRUE</code> or a
<code>col.names</code> argument is dangerous. If you have to load the data to
figure out what the column names should be, try loading it again with
the column names you’ve assigned.</p>
</blockquote>
<h2>Preparing the raw data to load into a data frame.</h2>
<p>Now that we’ve discovered irregularities in the raw data that are
preventing it from fitting neatly into a data frame, we have to fix them.</p>
<p>There are two options, both involve processing the file line-by-line.
First, we can take the data in the columns after the sixth and append
them to the end of the data in the sixth column. The sixth column is a
long text discription of the event, and the extra columns are likely to
be continuations of that description. But, we don’t actually end up
caring about the long description in our analysis, so I’ll take a second
approach and just delete those extra columns.</p>
<p>The procedure is encapsulated in the function below. It reads lines from
the original file, <code>inpath</code>, cleans them, and writes the result to
<code>outpath</code>. Note that this function doesn’t actually return anything;
it’s just a side-effect on the <code>outpath</code> file.</p>
<div class="highlight"><pre><span class="k">def</span> <span class="nf">ufotab_to_sixcols</span><span class="p">(</span><span class="n">inpath</span><span class="p">,</span> <span class="n">outpath</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd">Keep only the first 6 columns of data from messy UFO TSV file.</span>
<span class="sd">The UFO data set is only supposed to have six columns. But...</span>
<span class="sd">The sixth column is a long written description of the UFO sighting, and</span>
<span class="sd">sometimes is broken by tab characters which create extra columns.</span>
<span class="sd">For these records, we only keep the first six columns. This typically</span>
<span class="sd">cuts off some of the long description.</span>
<span class="sd">Sometimes a line has less than six columns. These are not written to</span>
<span class="sd">the output file (i.e., they're dropped from the data). These records</span>
<span class="sd">are usually so comprimised as to be uncleanable anyway.</span>
<span class="sd">This function has (is) a side effect on the outpath file, to which it</span>
<span class="sd">writes output.</span>
<span class="sd">'''</span>
<span class="n">inf</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">inpath</span><span class="p">,</span> <span class="s">'r'</span><span class="p">)</span>
<span class="n">outf</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">outpath</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span>
<span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">inf</span><span class="p">:</span>
<span class="n">splitline</span> <span class="o">=</span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">'</span><span class="se">\t</span><span class="s">'</span><span class="p">)</span>
<span class="c"># Skip short lines, which are dirty beyond repair, anyway.</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">splitline</span><span class="p">)</span> <span class="o"><</span> <span class="mi">6</span><span class="p">:</span>
<span class="k">continue</span>
<span class="n">newline</span> <span class="o">=</span> <span class="p">(</span><span class="s">'</span><span class="se">\t</span><span class="s">'</span><span class="p">)</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">splitline</span><span class="p">[</span> <span class="p">:</span><span class="mi">6</span><span class="p">])</span>
<span class="c"># Records that have been truncated won't end in a newline character</span>
<span class="c"># so add one.</span>
<span class="k">if</span> <span class="n">newline</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">:</span> <span class="p">]</span> <span class="o">!=</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">:</span>
<span class="n">newline</span> <span class="o">+=</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span>
<span class="n">outf</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">newline</span><span class="p">)</span>
<span class="n">inf</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="n">outf</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
</pre></div>
<p>This function performs the following steps:</p>
<ol>
<li>Open the input file for reading and the output file for writing.</li>
<li>Read a line from the original file.</li>
<li>Split the line into columns at the tab characters using the
<code>split()</code> method.</li>
<li>If line is split into less than six columns, ignore this line and go
read the next one.</li>
<li>Otherwise rejoin the first six columns of the split line back
together with tab characters using the <code>join()</code> method. This results
in <code>newline</code>.</li>
<li>If there’s not a line break character at the end of <code>newline</code> (which
will happen if we’ve cut off the ending column because it was past
the sixth column), then add one on.</li>
<li>Write <code>newline</code> to the output file.</li>
<li>Repeat 2-7 with the next line of the input file.</li>
</ol>
<p>Note that step 4 means that short lines with less than six columns (5
tabs) don’t get written to the cleaned file. I haven’t investigated in
depth why some rows are too short and whether there’s a way to fix those
rows instead of tossing them out, but it’s unlikely the fix would be
simple or reliable.</p>
<p>I run the function to create a cleaned-up tab-separated file called
<code>ufo_awesome_6col.tsv</code>. (The path to the input file, <code>inpath</code>, was
already defined).</p>
<div class="highlight"><pre><span class="n">outpath</span> <span class="o">=</span> <span class="s">'data/ufo/ufo_awesome_6col.tsv'</span>
<span class="n">ufotab_to_sixcols</span><span class="p">(</span><span class="n">inpath</span><span class="p">,</span> <span class="n">outpath</span><span class="p">)</span>
</pre></div>
<h2>Trying <code>read_table()</code> again.</h2>
<p>Now I’ll try using Pandas and <code>read_table()</code> again to load the file into
a data frame. (Since I know what the column names are supposed to be,
I’ll just pass them to the function instead of adding them later.)</p>
<div class="highlight"><pre><span class="n">ufo</span> <span class="o">=</span> <span class="n">read_table</span><span class="p">(</span><span class="s">'data/ufo/ufo_awesome_6col.tsv'</span><span class="p">,</span> <span class="n">sep</span> <span class="o">=</span> <span class="s">'</span><span class="se">\\</span><span class="s">t'</span><span class="p">,</span>
<span class="n">na_values</span> <span class="o">=</span> <span class="s">''</span><span class="p">,</span> <span class="n">header</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
<span class="n">names</span> <span class="o">=</span> <span class="p">[</span><span class="s">'date_occurred'</span><span class="p">,</span> <span class="s">'date_reported'</span><span class="p">,</span>
<span class="s">'location'</span><span class="p">,</span> <span class="s">'short_desc'</span><span class="p">,</span> <span class="s">'duration'</span><span class="p">,</span>
<span class="s">'long_desc'</span><span class="p">])</span>
</pre></div>
<p>And this now runs without a hitch. We’ll use the <code>head()</code> and
<code>to_string()</code> methods of a Pandas data frame to compare the first six
rows of the data to what’s shown in the table on p. 14 of <span class="caps">MLFH</span>.</p>
<div class="highlight"><pre><span class="n">ufo</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">6</span><span class="p">)</span><span class="o">.</span><span class="n">to_string</span><span class="p">(</span><span class="n">formatters</span> <span class="o">=</span>
<span class="p">{</span><span class="s">'long_desc'</span> <span class="p">:</span> <span class="k">lambda</span> <span class="n">x</span> <span class="p">:</span> <span class="n">x</span><span class="p">[</span> <span class="p">:</span><span class="mi">21</span><span class="p">]})</span>
</pre></div>
<p>The dictionary in the <code>formatters</code> argument tells <code>to_string()</code> to only
print the first 21 characters in the long description. The result is the
following table:</p>
<div class="highlight"><pre> date_occurred date_reported location short_desc duration long_desc
0 19951009 19951009 Iowa City, IA NaN NaN Man repts. witnessing
1 19951010 19951011 Milwaukee, WI NaN 2 min. Man on Hwy 43 SW of
2 19950101 19950103 Shelton, WA NaN NaN Telephoned Report:CA
3 19950510 19950510 Columbia, MO NaN 2 min. Man repts. son&apos;s
4 19950611 19950614 Seattle, WA NaN NaN Anonymous caller rept
5 19951025 19951024 Brunswick County, ND NaN 30 min. Sheriff&apos;s office
</pre></div>
<p>And this matches the authors’ table on p. 14. So we’re off to a good
start. In the next post we’ll clean this data up some more and do some
munging to get at the information we’re interested in.</p>