Machine Learning for Hackers Chapter 1, Part 3: Simple summaries and plots.

Introduction

See Part 1 and Part 2 for previous work.

In this part, I’ll replicate the authors’ exploration of the UFO sighting dates via histograms. The key takeaways:

  1. The plotting methods in Pandas are easy and useful.
  2. Unlike R Dates, Python datetimes aren’t compatible with a lot of mathematical operations. We’ll see that you can’t apply quantile or histogram methods to them directly.

Quick data summary methods and datetime complications.

For those playing along at home, I’m at p. 19 of the book. The first thing the authors do here is get a statistical summary of the sighting dates in the data, which are recorded in the DateOccurred variable (which I’ve named date_occurred in my code). This is easy in R using the summary function, which provides the minimum, maximum, and quartiles of the data by default.

Pandas has similar functionality, in a method called describe, which gives the same for numeric variables, plus the count of non-null values and the mean and standard deviation. For example:

s1 = Series(np.random.randn(100))
print s1.describe()

outputs what we’d expect from a series of randomly-generated standard normals:

count 100.000000
mean -0.149274
std 1.011230
min -2.521374
25% -0.790867
50% -0.167813
75% 0.596617
max 2.231157

If we apply this to the date_occurred series, though, we get something different.

ufo_us['date_occurred'].describe()[/sourcecode]

results in:

count 52134
unique 8786
top 1999-11-16 00:00:00
freq 185

because Pandas treats datetime series as non-numeric variables (which they technically are).

Note: To compute quantiles for numeric series, Pandas uses SciPy’s scoreatpercentile function, which in turn relies on a simple linear interpolation function (_interpolate in scipy.stats). datetime objects don’t play well with this function, since when you take the difference between two datetimes you don’t get a number, but instead a timedelta tuple, that you can’t perform mathematical operations on until you unpack it. The min and max methods will work on datetimes, though.

We can get around this by extracting the years from the variable, which will be integers.

years = ufo_us['date_occurred'].map(lambda x: x.year)
print years.describe()

results in:

count 52134.000000
mean 2000.572237
std 10.889045
min 1400.000000
25% 1999.000000
50% 2003.000000
75% 2007.000000
max 2010.000000

which is a little precise for year data, but how is Pandas to know? At any rate, we come to the same conclusion as the authors: that three quarters of the sightings occurred in 1999 or later, and the earliest date in the data is in 1400. (If we check, we’ll see this sighting occurred in Texas, so it’s certainly an error).

Plotting histograms

The authors then plot a histogram of the dates in the data. Like with quantile, the hist plot method (which just calls a Matplotlib histogram) doesn’t work with datetime data. If we try

ufo_us['date_occurred'].hist()

we’ll get an error complaining that datetime can’t be compared with float. So, I’ll just work with the years instead of the full datetime. I can generate the plot with a call to the series’ hist method, one of several plotting methods for Pandas objects that makes it extremely easy to get quick plots of them.

plt.figure()
years.hist(bins = (years.max() - years.min())/30., fc = 'steelblue')
plt.title('Histogram of years with U.S. UFO sightings\nAll years in data')
plt.savefig('quick_hist_all_years.png')

I explicitly set the bins to match the ggplot defaults used in the book. We get this plot, which basically matches the authors’:

The authors then focus on only data after 1990, using R’s subset function to remove earlier observations from the data. This is straightforward in Pandas. I’ll also extract another series with the years of this subset of dates.

ufo_us = ufo_us[ufo_us['date_occurred'] \>= dt.datetime(1990, 1, 1)]
years_post90 = ufo_us['date_occurred'].map(lambda x: x.year)

After subsetting, the authors have 46,347 rows left in the data. Looking at the shape attribute of the subsetted data frame, we have 46,780. We’ve picked up some observations from D.C., as well as from our more expansive method of finding U.S. locations.

Another histogram of the subset data looks similar to the authors’ chart on p. 23, but since I’m only histogramming over years, I lose some resolution.

While the histogram is fine for a quick look at the distribution of dates, it’s not a very accurate picture of how sightings evolve over time: the binning really destroys too much information. It makes more sense just to do a time-series plot of total sightings by date. We can do that with some data aggregation and an easy call to the plot method in Pandas.

post90_count = ufo_us.groupby('date_occurred')['date_occurred'].count()
plt.figure()
post90_count.plot()
plt.title('Number of U.S. UFO sightings\\nJanuary 1990 through August 2010')
plt.savefig('post90_count_ts.png')

This uses Pandas’ awesome groupby method, which I’ll discuss more in the next part. We get the following figure:

Based on this graph, it looks like there’s a seasonal component to sightings, which wasn’t apparent in the histogram. There are also a few large spikes, especially around the end of the millenium.

Conclusion

This part was a relatively easy one. The next part will focus on data aggregation using groupby and reindex methods. Then I’ll wrap up with with replicating the authors’ trellis graph.

Comments