posts in the Will it Python category

Machine Learning for Hackers, Chapter 1, Part 4: Data aggregation and reshaping.

Introduction

In the last part I made some simple summaries of the cleaned UFO data: basic descriptive statistics and historgrams. At the very end, I did some simple data aggregation by summing up the sightings by date, and plotted the resulting time series. In this part, I’ll go further with the aggregation, totalling sightings by state and month.

This takeaway from this part is that Pandas dataframes have some powerful methods for aggregating and manipulating data. I’ll show groupby, reindex, hierarchical indices, and stack and unstack in action.

The shape of data: the long and the wide of it

The first step in aggregating and reshaping data is to figure out the final form you want the data to be in. This form is basically defined by content and shape.

We know what we want the content to be: an entry in the data should give the number of sightings in a state/month combination.

We have two choices for the shape: wide or long. The wide version of this data would have months as the rows and states as the columns; it would be a 248 by 51 table with the number of sigthings as entries. This ...

Machine Learning for Hackers Chapter 1, Part 3: Simple summaries and plots.

Introduction

See Part 1 and Part 2 for previous work.

In this part, I’ll replicate the authors’ exploration of the UFO sighting dates via histograms. The key takeaways:

  1. The plotting methods in Pandas are easy and useful.
  2. Unlike R Dates, Python datetimes aren’t compatible with a lot of mathematical operations. We’ll see that you can’t apply quantile or histogram methods to them directly.

Quick data summary methods and datetime complications.

For those playing along at home, I’m at p. 19 of the book. The first thing the authors do here is get a statistical summary of the sighting dates in the data, which are recorded in the DateOccurred variable (which I’ve named date_occurred in my code). This is easy in R using the summary function, which provides the minimum, maximum, and quartiles of the data by default.

Pandas has similar functionality, in a method called describe, which gives the same for numeric variables, plus the count of non-null values and the mean and standard deviation. For example:

s1 = Series(np.random.randn(100))
print s1.describe()

outputs what we’d expect from a series of randomly-generated standard normals:

count 100.000000
mean -0.149274 ...

Machine Learning for Hackers Chapter 1, Part 2: Cleaning date and location data

Introduction

In the previous post, I loaded the raw UFO data into a Pandas data frame after cleaning up some irregularities in the text file. Since we’re ultimately concerned with analyzing UFO sightings over time and space, the next step is to clean those variables and prepare them for analysis and vizualization.

Some Python techniques to note in this part are:

  • Like in the last part, Python string methods are going to come in really handy, and be a simple, expressive solution to a lot of problems.
  • When those aren’t enough, Python has a pretty straightforward set of functions for implementing regular expressions.
  • The map() method in Pandas can be used to “vectorize” functions along a Series (i.e. a data frame column) and is similar to R’s apply. In general, using a NumPy ufunc (vectorized function) is preferable, but not all operations can be expressed in ufuncs. This is especially true for non-numeric operations, such as for strings or dates.

Cleaning dates: mapping and subsetting.

The first two columns of the data are dates in YYMMDDD format, and Pandas imported them as integers. R has a function, as.Date that will operate on a vector ...

Machine Learning for Hackers Chapter 1, Part 1: Loading data

Preface

This is my first Will it Python? post. These posts document my experiences trying to port complete and interesting R projects to Python. I’m beginning by going through the recently published Machine Learning for Hackers (MLFH) by Drew Conway and John Miles White.

More information on the posts is here, and archives are here.

Introduction

The first chapter of MLFH is a gentle introduction to loading, manipulating and graphing data in R. To keep the tutorial interesting, the authors have found a fun dataset of UFO sightings to work through.

Since this chapter is mainly devoted to loading and manipulating data, a lot of the R functionality they exploit is going to have an analog in Pandas. Even though there’s not too much exciting going on in this chapter, it’s a great way to explore how basic data tasks get done in Python. It turns out there are some interesting differences between how R and Python handle even this simple stuff.

In this first post, I’ll focus on just getting the data into the work environment. The complete code for the chapter is located in a Github repo, here.

Data with inconsistent column lengths: break ...

« Page 2 / 2