Machine Learning for Hackers, Chapter 1, Part 4: Data aggregation and reshaping.
Introduction
In the last part I made some simple summaries of the cleaned UFO data: basic descriptive statistics and historgrams. At the very end, I did some simple data aggregation by summing up the sightings by date, and plotted the resulting time series. In this part, I’ll go further with the aggregation, totalling sightings by state and month.
This takeaway from this part is that Pandas dataframes have some
powerful methods for aggregating and manipulating data. I’ll show
groupby
, reindex
, hierarchical indices, and stack
and unstack
in action.
The shape of data: the long and the wide of it
The first step in aggregating and reshaping data is to figure out the final form you want the data to be in. This form is basically defined by content and shape.
We know what we want the content to be: an entry in the data should give the number of sightings in a state/month combination.
We have two choices for the shape: wide or long. The wide version of this data would have months as the rows and states as the columns; it would be a 248 by 51 table with the number of sigthings as entries. This ...