Machine Learning for Hackers, Chapter 1, Part 4: Data aggregation and reshaping.

Introduction

In the last part I made some simple summaries of the cleaned UFO data: basic descriptive statistics and historgrams. At the very end, I did some simple data aggregation by summing up the sightings by date, and plotted the resulting time series. In this part, I’ll go further with the aggregation, totalling sightings by state and month.

This takeaway from this part is that Pandas dataframes have some powerful methods for aggregating and manipulating data. I’ll show `groupby`, `reindex`, hierarchical indices, and `stack` and `unstack` in action.

The shape of data: the long and the wide of it

The first step in aggregating and reshaping data is to figure out the final form you want the data to be in. This form is basically defined by content and shape.

We know what we want the content to be: an entry in the data should give the number of sightings in a state/month combination.

We have two choices for the shape: wide or long. The wide version of this data would have months as the rows and states as the columns; it would be a 248 by 51 table with the number of sigthings as entries. This ...

Machine Learning for Hackers Chapter 1, Part 3: Simple summaries and plots.

Introduction

See Part 1 and Part 2 for previous work.

In this part, I’ll replicate the authors’ exploration of the UFO sighting dates via histograms. The key takeaways:

1. The plotting methods in Pandas are easy and useful.
2. Unlike R `Dates`, Python `datetimes` aren’t compatible with a lot of mathematical operations. We’ll see that you can’t apply quantile or histogram methods to them directly.

Quick data summary methods and datetime complications.

For those playing along at home, I’m at p. 19 of the book. The first thing the authors do here is get a statistical summary of the sighting dates in the data, which are recorded in the `DateOccurred` variable (which I’ve named `date_occurred` in my code). This is easy in R using the `summary` function, which provides the minimum, maximum, and quartiles of the data by default.

Pandas has similar functionality, in a method called `describe`, which gives the same for numeric variables, plus the count of non-null values and the mean and standard deviation. For example:

```s1 = Series(np.random.randn(100))
print s1.describe()
```

outputs what we’d expect from a series of randomly-generated standard normals:

```count 100.000000
mean -0.149274 ...```

Machine Learning for Hackers Chapter 1, Part 2: Cleaning date and location data

Introduction

In the previous post, I loaded the raw UFO data into a Pandas data frame after cleaning up some irregularities in the text file. Since we’re ultimately concerned with analyzing UFO sightings over time and space, the next step is to clean those variables and prepare them for analysis and vizualization.

Some Python techniques to note in this part are:

• Like in the last part, Python string methods are going to come in really handy, and be a simple, expressive solution to a lot of problems.
• When those aren’t enough, Python has a pretty straightforward set of functions for implementing regular expressions.
• The `map()` method in Pandas can be used to “vectorize” functions along a Series (i.e. a data frame column) and is similar to R’s `apply`. In general, using a NumPy `ufunc` (vectorized function) is preferable, but not all operations can be expressed in `ufunc`s. This is especially true for non-numeric operations, such as for strings or dates.

Cleaning dates: mapping and subsetting.

The first two columns of the data are dates in `YYMMDDD` format, and Pandas imported them as integers. R has a function, `as.Date` that will operate on a vector ...

Preface

This is my first Will it Python? post. These posts document my experiences trying to port complete and interesting R projects to Python. I’m beginning by going through the recently published Machine Learning for Hackers (MLFH) by Drew Conway and John Miles White.