Machine Learning for Hackers Chapter 1, Part 2: Cleaning date and location data

Introduction

In the previous post, I loaded the raw UFO data into a Pandas data frame after cleaning up some irregularities in the text file. Since we’re ultimately concerned with analyzing UFO sightings over time and space, the next step is to clean those variables and prepare them for analysis and vizualization.

Some Python techniques to note in this part are:

  • Like in the last part, Python string methods are going to come in really handy, and be a simple, expressive solution to a lot of problems.
  • When those aren’t enough, Python has a pretty straightforward set of functions for implementing regular expressions.
  • The map() method in Pandas can be used to “vectorize” functions along a Series (i.e. a data frame column) and is similar to R’s apply. In general, using a NumPy ufunc (vectorized function) is preferable, but not all operations can be expressed in ufuncs. This is especially true for non-numeric operations, such as for strings or dates.

Cleaning dates: mapping and subsetting.

The first two columns of the data are dates in YYMMDDD format, and Pandas imported them as integers. R has a function, as.Date that will operate on a vector ...

Machine Learning for Hackers Chapter 1, Part 1: Loading data

Preface

This is my first Will it Python? post. These posts document my experiences trying to port complete and interesting R projects to Python. I’m beginning by going through the recently published Machine Learning for Hackers (MLFH) by Drew Conway and John Miles White.

More information on the posts is here, and archives are here.

Introduction

The first chapter of MLFH is a gentle introduction to loading, manipulating and graphing data in R. To keep the tutorial interesting, the authors have found a fun dataset of UFO sightings to work through.

Since this chapter is mainly devoted to loading and manipulating data, a lot of the R functionality they exploit is going to have an analog in Pandas. Even though there’s not too much exciting going on in this chapter, it’s a great way to explore how basic data tasks get done in Python. It turns out there are some interesting differences between how R and Python handle even this simple stuff.

In this first post, I’ll focus on just getting the data into the work environment. The complete code for the chapter is located in a Github repo, here.

Data with inconsistent column lengths: break ...

« Page 3 / 3