# Machine Learning for Hackers Chapter 2, Part 1: Summary stats and density estimators

Chapter 2 of MLFH summarizes techniques for exploring your data: determining data types, computing quantiles and other summary statistics, and plotting simple exploratory graphics. I’m not going to replicate it in its entirety; I’m just going to hit some of the more involved or interesting parts. The IPython notebook I created for this chapter, which lives here, contains more code than I’ll present on the blog.

This part’s highlights:

1. Pandas objects, as we’ve seen before, have methods that provide simple summary statistics.
2. The plotting methods in Pandas let you pass parameters to the Matplotlib functions they call. I’ll use this feature to mess around with histogram bins.
3. The `gaussian_kde` (kernel density estimator) function in `scipy.stats.kde` provides density estimates similar to R’s `density` function for Gaussian kernels. The `kdensity` function, in `statsmodels.nonparametric.kde` provides that and other kernels, but given the state of `statsmodels` documentation, you would probably only find this function by accident. It’s also substantially slower than `gaussian_kde` on large data. *[Not quite so! See update at the end.]

## Height and weight data

The data analyzed in this chapter are the sexes, heights and weights, of 10,000 people. The raw file is a CSV that I import using `read_table` in Pandas:

```heights_weights =
```

Inspecting the data with `head`,

```print heights_weights.head(10)
```

gives us:

``` Gender    Height     Weight
0  Male 73.847017 241.893563
1  Male 68.781904 162.310473
2  Male 74.110105 212.740856
3  Male 71.730978 220.042470
4  Male 69.881796 206.349801
5  Male 67.253016 152.212156
6  Male 68.785081 183.927889
7  Male 68.348516 167.971110
8  Male 67.018950 175.929440
9  Male 63.456494 156.399676
```

So it looks like heights are in inches, and weights are in pounds. It also looks like the dataset is evenly split between men and women, since

```heights_weights.groupby('Gender')['Gender'].count()
```

results in:

```Gender
Female 5000
Male 5000
```

The data are simple, clean, and appear to have imported correctly. So, we can start looking at some simple summaries.

## Numeric summaries, especially quantiles

The first part of Chapter 2 covers the basic summary statistics: means, medians, variances, and quantiles. The authors hand-roll the mean, median, and variance functions to see how each is calculated. All of these methods are available as methods to Pandas series, or as NumPy functions (which are typically what’s called by equivalent Pandas methods).

The `describe` method of Pandas series and data frames, which we saw in Part 3 of Chapter 1, gives summary statistics. The summary stats for the height variable are:

```heights = heights_weights['Height']
heights.describe()

count 10000.000000
mean 66.367560
std 3.847528
min 54.263133
25% 63.505620
50% 66.318070
75% 69.174262
max 78.998742
```

The heights all lay within a reasonable range, with no apparent outliers from bad data. The default quantile range in `describe` is 50%, so we get the 75th and 25th percentiles. This can be changed with the `percentile_width` argument; for example, `percentile_width = 90` would give the 95th and 5th percentiles.

There doesn’t seem to be a direct analog to R’s `range` function, which calculates the difference between the maximum and minimum value of a vector, nor for the `quantile`, which can calculate the quantiles at any given a series of probabilities. These are easy enough to replicate though.

Note: Nathaniel Smith, in comments, points out that R’s `range` function doesn’t do this either, but just returns the min and max of a vector. There is a function for this in NumPy, though: the `my_range` function below gives the same result as would `np.ptp(heights.values)`. `ptp` is the “peak-to-peak” (min-to-max) function.

Range is trivial:

```def my_range(s):
'''
Difference between the max and min of an array or Series
'''
return s.max() - s.min()
```

Calling this, we get a range of 78.99 − 54.26 = 24.63 inches.

Next, a `quantiles` function to mimic R’s. We can just make a wrapper around the `quantile` method, mapping it along a sequence of provided probabilities.

```def my_quantiles(s, prob = (0.0, 0.25, 0.5, 1.0)):
'''
Calculate quantiles of a series.

Parameters:
-----------
s : a pandas Series
prob : a tuple (or other iterable) of probabilities at
which to compute quantiles. Must be an iterable,
even for a single probability (e.g. prob = (0.50,)
not prob = 0.50).

Returns:
--------
A pandas series with the probabilities as an index.
'''
q = [s.quantile(p) for p in prob]
return Series(q, index = prob)
```

Note that the default argument gives quartiles. We can get deciles by calling:

```print my_quantiles(heights, prob = arange(0, 1.1, 0.1))
```

which spits out:

```0.0 54.263133
0.1 61.412701
0.2 62.859007
0.3 64.072407
0.4 65.194221
0.5 66.318070
0.6 67.435374
0.7 68.558072
0.8 69.811620
0.9 71.472149
1.0 78.998742
```

Note: the `quantiles` function I’ve written is a little awkward when dealing with a single quantile. Because the list comprehension that computes the qunatiles requires that the `prob` argument be an iterable, you would have to pass a list, tuple, array or other iterable with a single value. You can’t just pass it a float. I’ve hit this issue a few times writing Python functions–where it’s difficult to make code robust to both iterable and singleton arguments. If anyone has tips on this (should I really be doing type checking?), I’d be thrilled to hear them.

## Histograms

Next the authors mess around with histograms and density plots to explore the distribution of the data. Noting that different bin sizes for histograms can affect how we perceive the data’s distribution, they plot histograms for a few different bin widths.

In Matplotlib, bins are not specified by their width, as is possible ggplot. We can either give Matplotlib the number of bins we want it to plot, or specify the actual bin-edge locations. It’s not difficult to translate a desired bin width into either one of these types of argument. I’ll provide the sequence of bins.

First, 1-inch bins:

```bins1 = np.arange(heights.min(), heights.max(), 1.0)
heights.hist(bins = bins1, fc = 'steelblue')
``` Note how I’m using the Pandas `hist` method, which, using a `**kwargs` argument, can pass parameters to the Matplotlib plotting functions. Next, 5-inch bins:

```bins5 = np.arange(heights.min(), heights.max(), 5.)
heights.hist(bins = bins5, fc = 'steelblue')
``` And finally, 0.001-inch bins:

```bins001 = np.arange(heights.min(), heights.max(), .001)
heights.hist(bins = bins001, fc = 'steelblue')
plt.savefig('height_hist_bins001.png')
``` These all match the figures in the book, so I’m probably doing it right.

## Kernel density estimators in SciPy and statsmodels

R’s `density` function computes kernel density estimates. The default kernel is Gaussian, but you can also use Epanechnikov, rectangular, triangular, biweight, cosine kernels.

In Python, it looks like you have two options for kernel density. The first is `gaussian_kde` from the `scipy.stats.kde` module. This provides a Gaussian kernel density estimate only. The other is `kdensity` in the `statsmodels.nonparametric.kde` module, which provides alternative kernels similar to R.

I actually wasn’t aware of the `kdensity` function for a while, until I stumbled upon a mention of it on a mailing list archive. I couldn’t find it in the statsmodels documentation. Statsmodels, generally, seems to have a lot of undocumented functionality; not surprising for a young, rapidly-expanding project.

Playing with both functions, I found some pros and cons for each. Obviously `kdensity` provides an option of kernels, whereas `gaussian_kde` does not. `kdensity` also generates simpler output than `gaussian_kde`. `kdensity` provides a tuple of two arrays–the grid of points at which the density was estimated, and the estimated density of those points. `gaussian_kde` provides an object that you have to evaluate on a set of points to get an array of estimated densities. So essentially, you’re calling it twice, and I don’t see much point to that redundancy.

On the other hand `kdensity` gets much slower than `gaussian_kde` as the number of points increases. For the 10,000 points in the = `heights` array, `gaussian_kde` took about 3.3 seconds to output the array of estimated densities. `kdensity` wasn’t finished after several minutes. I haven’t looked carefully at the source code of the two functions, but I assume `kdensity`s problem is that at some point it creates a temporary `NxN` array, which for `N = 10,000` is going to gum things up. Setting the `gridsize` argument in `kdensity` to something even as large as `5000`, cuts the size of the temporary array in half, and reduces the running time to about 3 seconds.

This is probably worth exploring in a future post. In the meantime, I’m going stick with `gaussian_kde` and plot some densities. Note: See the update below. I’ve updated the IPython notebook for this chapter to use Statsmodels’ KDE class instead of SciPy.]

First, heights:

```density = kde.gaussian_kde(heights.values)

fig = plt.figure()
plt.plot(np.sort(heights.values),
density(np.sort(heights.values)))
``` The sorting of the `heights` array is to make the lines connect nicely. Otherwise, the lines will connect from point-to-point in the order they occur in the array; we want the density curve to connect points left-to-right.

Notice the slight bi-modality in the figure. What we’re likely seeing is a mixture of male and female distributions. We can plot those separately.

```# Pull out male and female heights as arrays over which to compute densities
heights_m = heights[heights_weights['Gender'] == 'Male'].values
heights_f = heights[heights_weights['Gender'] == 'Female'].values
density_m = kde.gaussian_kde(heights_m)
density_f = kde.gaussian_kde(heights_f)

fig = plt.figure()
plt.plot(np.sort(heights_m), density_m(np.sort(heights_m)), label = 'Male')
plt.plot(np.sort(heights_f), density_f(np.sort(heights_f)), label = 'Female')
plt.legend()
``` We also have a weight variable we can plot.

```weights_m = heights_weights[heights_weights['Gender'] == 'Male']['Weight'].values
weights_f = heights_weights[heights_weights['Gender'] == 'Female']['Weight'].values
density_m = kde.gaussian_kde(weights_m)
density_f = kde.gaussian_kde(weights_f)

fig = plt.figure()
plt.plot(np.sort(weights_m), density_m(np.sort(weights_m)), label = 'Male')
plt.plot(np.sort(weights_f), density_f(np.sort(weights_f)), label = 'Female')
plt.legend()
``` To finish up, let’s move each density plot to its own subplot, to match Figure 2-11 on page 51.

```fig, axes = plt.subplots(nrows = 2, ncols = 1, sharex = True, figsize = (9, 6))
axes.plot(np.sort(weights_f), density_f(np.sort(weights_f)),
label = 'Female')
axes.xaxis.tick_top()
axes.legend()
axes.plot(np.sort(weights_m), density_m(np.sort(weights_m)),
label = 'Male')
axes.legend()
``` Here I’m using the `subplots` function, same as in Part 5 of Chapter 1, and sharing the x-axis to make clear the difference between the distributions’ central tendencies.

## Conclusion

I’ll wrap up Chapter 2 in the next post, where I’ll look at lowess smoothing in Statsmodels, and get a little taste of logistic regression.

## Update!

Statsmodels honcho skipper seabold sets me straight in the comments. While the `kdensity` function is slow, statsmodels has an implementation which uses Fast Fourier Transforms for Gaussian kernels and is substantially faster than Scipy’s `gaussian_kde`.

For the heights array:

```# Create a KDE object
heights_kde = sm.nonparametric.kde.KDE(heights.values)

# Estimate the density by fitting the object (default Gaussian kernel via FFT)
heights_kde.fit()
```

We can then plot this vector of estimated densities, `heights_kde.density` against the points in `heights_kde.support`.

I’ve updated the IPython notebook for this chapter to use Statsmodels’ KDE throughout, so check it out for more detail.