# *Machine Learning for Hackers* Chapter 1, Part 3: Simple summaries and plots.

## Introduction

See Part 1 and Part 2 for previous work.

In this part, I’ll replicate the authors’ exploration of the UFO sighting dates via histograms. The key takeaways:

- The plotting methods in Pandas are easy and useful.
- Unlike R
`Dates`

, Python`datetimes`

aren’t compatible with a lot of mathematical operations. We’ll see that you can’t apply quantile or histogram methods to them directly.

## Quick data summary methods and datetime complications.

For those playing along at home, I’m at p. 19 of the book. The first
thing the authors do here is get a statistical summary of the sighting
dates in the data, which are recorded in the `DateOccurred`

variable
(which I’ve named `date_occurred`

in my code). This is easy in R using
the `summary`

function, which provides the minimum, maximum, and
quartiles of the data by default.

Pandas has similar functionality, in a method called `describe`

, which
gives the same for numeric variables, plus the count of non-null values
and the mean and standard deviation. For example:

```
s1 = Series(np.random.randn(100))
print s1.describe()
```

outputs what we’d expect from a series of randomly-generated standard normals:

```
count 100.000000
mean -0.149274
std 1.011230
min -2.521374
25% -0.790867
50% -0.167813
75% 0.596617
max 2.231157
```

If we apply this to the `date_occurred`

series, though, we get something different.

```
ufo_us['date_occurred'].describe()[/sourcecode]
```

results in:

```
count 52134
unique 8786
top 1999-11-16 00:00:00
freq 185
```

because Pandas treats `datetime`

series as non-numeric variables (which
they technically are).

Note: To compute quantiles for numeric series, Pandas uses SciPy’s`scoreatpercentile`

function, which in turn relies on a simple linear interpolation function (`_interpolate`

in`scipy.stats`

).`datetime`

objects don’t play well with this function, since when you take the difference between two`datetimes`

you don’t get a number, but instead a`timedelta`

tuple, that you can’t perform mathematical operations on until you unpack it. The`min`

and`max`

methods will work on`datetimes`

, though.

We can get around this by extracting the years from the variable, which will be integers.

```
years = ufo_us['date_occurred'].map(lambda x: x.year)
print years.describe()
```

results in:

```
count 52134.000000
mean 2000.572237
std 10.889045
min 1400.000000
25% 1999.000000
50% 2003.000000
75% 2007.000000
max 2010.000000
```

which is a little precise for year data, but how is Pandas to know? At any rate, we come to the same conclusion as the authors: that three quarters of the sightings occurred in 1999 or later, and the earliest date in the data is in 1400. (If we check, we’ll see this sighting occurred in Texas, so it’s certainly an error).

Plotting histograms

The authors then plot a histogram of the dates in the data. Like with
`quantile`

, the `hist`

plot method (which just calls a Matplotlib
histogram) doesn’t work with `datetime`

data. If we try

```
ufo_us['date_occurred'].hist()
```

we’ll get an error complaining that `datetime`

can’t be compared with
`float`

. So, I’ll just work with the years instead of the full
`datetime`

. I can generate the plot with a call to the series’ `hist`

method, one of several plotting methods for Pandas objects that makes it
extremely easy to get quick plots of them.

```
plt.figure()
years.hist(bins = (years.max() - years.min())/30., fc = 'steelblue')
plt.title('Histogram of years with U.S. UFO sightings\nAll years in data')
plt.savefig('quick_hist_all_years.png')
```

I explicitly set the bins to match the ggplot defaults used in the book. We get this plot, which basically matches the authors’:

The authors then focus on only data after 1990, using R’s `subset`

function to remove earlier observations from the data. This is
straightforward in Pandas. I’ll also extract another series with the
years of this subset of dates.

```
ufo_us = ufo_us[ufo_us['date_occurred'] \>= dt.datetime(1990, 1, 1)]
years_post90 = ufo_us['date_occurred'].map(lambda x: x.year)
```

After subsetting, the authors have 46,347 rows left in the data. Looking
at the `shape`

attribute of the subsetted data frame, we have 46,780.
We’ve picked up some observations from D.C., as well as from our more
expansive method of finding U.S. locations.

Another histogram of the subset data looks similar to the authors’ chart on p. 23, but since I’m only histogramming over years, I lose some resolution.

While the histogram is fine for a quick look at the distribution of
dates, it’s not a very accurate picture of how sightings evolve over
time: the binning really destroys too much information. It makes more
sense just to do a time-series plot of total sightings by date. We can
do that with some data aggregation and an easy call to the `plot`

method
in Pandas.

```
post90_count = ufo_us.groupby('date_occurred')['date_occurred'].count()
plt.figure()
post90_count.plot()
plt.title('Number of U.S. UFO sightings\\nJanuary 1990 through August 2010')
plt.savefig('post90_count_ts.png')
```

This uses Pandas’ awesome `groupby`

method, which I’ll discuss more in
the next part. We get the following figure:

Based on this graph, it looks like there’s a seasonal component to sightings, which wasn’t apparent in the histogram. There are also a few large spikes, especially around the end of the millenium.

## Conclusion

This part was a relatively easy one. The next part will focus on data
aggregation using `groupby`

and `reindex`

methods. Then I’ll wrap up
with with replicating the authors’ trellis graph.

## Comments